pith. sign in

arxiv: 2604.27372 · v1 · submitted 2026-04-30 · 🧮 math.OC · cs.LG· cs.MA

Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations

Pith reviewed 2026-05-07 09:11 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.MA
keywords mean-field controlq-functionentropy regularizationcommon noisepolicy iterationHamilton-Jacobi-Bellman equationlinear-quadratic control
0
0 comments X

The pith

Under a concavity condition, the optimal policy for entropy-regularized mean-field control with common noise is identified as a two-layer fixed point of the argmax operator on the integrated q-function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the continuous-time q-function for entropy-regularized mean-field control problems that include controlled common noise. It first establishes that the value function from discretely sampled actions converges to the relaxed-control value function as the time grid is refined. This convergence permits derivation of an exploratory Hamilton-Jacobi-Bellman equation whose extra nonlinear term arises from the controlled common noise. Under a stated concavity condition, the paper proves existence and uniqueness of the optimal one-step policy iteration by means of a first-order condition that uses the partial linear functional derivative with respect to the policy. It then defines the integrated q-function on the joint space of state distributions and policies and shows that optimal policies are two-layer fixed points of the associated argmax operator, with an explicit Gaussian form obtained in the linear-quadratic case.

Core claim

We investigate the continuous-time counterpart of the q-function for entropy-regularized mean-field control with controlled common noise. We show that, under discretely sampled actions, the value function in the exploratory formulation converges to the one in the relaxed control formulation as the time grid refines. Leveraging the relaxed control formulation, we derive the exploratory Hamilton-Jacobi-Bellman equation, in which the controlled common noise gives rise to an additional nonlinear functional of policy. Under a concavity condition, we establish the existence and uniqueness of the optimal one-step policy iteration via a first-order condition using the partial linear functional der

What carries the argument

The integrated q-function (Iq-function) defined on the state distribution and the policy, which identifies an optimal policy as a two-layer fixed point of its argmax operator.

If this is right

  • The value function under discrete action sampling converges to the relaxed-control value as the discretization refines.
  • The exploratory HJB equation incorporates an extra nonlinear functional of policy induced by controlled common noise.
  • Policy improvement at each iteration corresponds to an entropy-regularized optimization problem over the space of policies.
  • In the general linear-quadratic setting the optimal policy is explicitly a Gaussian distribution.
  • Optimal policies satisfy the two-layer fixed-point relation with respect to the argmax of the Iq-function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-layer fixed-point structure suggests an alternating numerical scheme that iterates between updating the policy and the state distribution.
  • The explicit Gaussian solution in the LQ case supplies an exact benchmark against which to test approximation methods for non-LQ problems.
  • The convergence result between discrete sampling and relaxed controls opens a route to simulation-based algorithms that discretize time while preserving the continuous-time limit.
  • Verification of the concavity condition in concrete applications such as portfolio optimization or traffic networks would determine the range of practical use.

Load-bearing premise

The concavity condition invoked to guarantee existence and uniqueness of the optimal one-step policy iteration.

What would settle it

A concrete mean-field control instance in which the concavity condition is violated yet the one-step policy iteration still admits a unique solution, or conversely admits multiple solutions.

Figures

Figures reproduced from arXiv: 2604.27372 by Xiang Yu, Xiaoli Wei, Xun Yu Zhou, Zhenjie Ren.

Figure 1
Figure 1. Figure 1: Conceptual relationship to the literature view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of two-layer fixed point (4.2) Remark 4.3. We could search for the optimal policy by considering (4.2) as a two-layer fixed point problem. Specifically, starting with an initial policy π 0 , at each iteration n ∈ N, we derive π n+1 by looking for the fixed point of the map Φπn , which is the inner layer of the two-layer fixed point problem. Recall that I is a map from π n to π n+1, that is, π … view at source ↗
read the original abstract

This paper investigates the continuous-time counterpart of the Q-function for entropy-regularized mean-field control (MFC) with controlled common noise, coined as q-function by Jia and Zhou (2023) in the single agent's model. We first show that, under discretely sampled actions, the value function in the exploratory formulation converges to the one in the relaxed control formulation as the time grid refines. Leveraging the relaxed control formulation, we derive the exploratory Hamilton-Jacobi-Bellman (HJB) equation, in which the controlled common noise gives rise to an additional nonlinear functional of policy, rendering the policy iteration intricate. Under certain concavity condition, we establish the existence and uniqueness of the optimal one-step policy iteration via a first-order condition using the partial linear functional derivative with respect to policy. The policy improvement at each iteration is verified by relating to an entropy-regularized optimization problem over the space of policies. In the mean-field setting, we introduce the integrated q-function (Iq-function) defined on the state distribution and the policy, and it is shown that an optimal policy is identified as a two-layer fixed point to the argmax operator of the Iq-function. Finally, we provide the explicit characterization of an optimal policy as a Gaussian distribution in the general linear-quadratic (LQ) setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops theoretical foundations for continuous-time q-learning in entropy-regularized mean-field control with controlled common noise. It establishes convergence of the exploratory value function (under discretely sampled actions) to the relaxed-control value as the time grid refines, derives the exploratory HJB equation (with an additional nonlinear functional of policy induced by controlled common noise), proves existence and uniqueness of the optimal one-step policy iteration under a concavity condition via a first-order condition using the partial linear functional derivative with respect to policy, introduces the integrated q-function (Iq-function) on state distributions and policies to obtain a two-layer fixed-point characterization of optimal policies, and derives an explicit Gaussian characterization of the optimal policy in the general linear-quadratic setting.

Significance. If the derivations hold, the work supplies a rigorous basis for policy-iteration algorithms in MFC with common noise, extending the single-agent q-function of Jia and Zhou (2023) while handling mean-field interactions. The explicit LQ Gaussian form provides a concrete, falsifiable prediction that can be checked numerically. The introduction of the Iq-function is a natural but non-trivial extension that organizes the fixed-point argument.

major comments (2)
  1. [Abstract / policy-iteration derivation] Abstract and the section deriving the one-step policy iteration: the 'certain concavity condition' required for existence and uniqueness of the optimal policy via the first-order condition (partial linear functional derivative w.r.t. policy) is never stated explicitly. It is unclear whether concavity is imposed on the exploratory value functional, on the Iq-function, or on the entropy-regularized objective, and in which topology on the space of relaxed controls. Because the subsequent two-layer fixed-point characterization and the verification of policy improvement both rest on this step, the claim cannot be assessed without the precise functional and topology.
  2. [Exploratory HJB derivation] Section on the exploratory HJB and the nonlinear functional of policy: the additional nonlinear term arising from controlled common noise is asserted to render policy iteration 'intricate,' yet no explicit verification is given that the concavity condition (whatever it is) is preserved under the mean-field interaction. If concavity fails to hold globally, the uniqueness of the argmax and the identification of the optimal policy as a two-layer fixed point of the Iq-function both collapse.
minor comments (2)
  1. [Iq-function introduction] The definition of the Iq-function (state distribution and policy arguments) should be written with explicit functional dependence on the mean-field measure to avoid ambiguity when the common-noise term is present.
  2. [Notation] Notation for relaxed controls versus ordinary controls should be unified across the convergence result and the HJB derivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which help clarify the presentation of our results on continuous-time q-learning for entropy-regularized mean-field control with common noise. We address each major comment below and will revise the manuscript to improve explicitness and rigor.

read point-by-point responses
  1. Referee: [Abstract / policy-iteration derivation] Abstract and the section deriving the one-step policy iteration: the 'certain concavity condition' required for existence and uniqueness of the optimal policy via the first-order condition (partial linear functional derivative w.r.t. policy) is never stated explicitly. It is unclear whether concavity is imposed on the exploratory value functional, on the Iq-function, or on the entropy-regularized objective, and in which topology on the space of relaxed controls. Because the subsequent two-layer fixed-point characterization and the verification of policy improvement both rest on this step, the claim cannot be assessed without the precise functional and topology.

    Authors: We agree that the concavity condition requires explicit statement. In the revised version, we will define it precisely as the strict concavity of the entropy-regularized objective (equivalently, the exploratory value functional) with respect to the relaxed control in the topology of weak convergence of measures on the space of admissible policies. This concavity is with respect to the policy variable for fixed state distribution, and it is used to guarantee a unique maximizer via the first-order condition involving the partial linear functional derivative. We will also note that the Iq-function inherits this property through its definition on state distributions and policies, ensuring the two-layer fixed-point characterization holds. revision: yes

  2. Referee: [Exploratory HJB derivation] Section on the exploratory HJB and the nonlinear functional of policy: the additional nonlinear term arising from controlled common noise is asserted to render policy iteration 'intricate,' yet no explicit verification is given that the concavity condition (whatever it is) is preserved under the mean-field interaction. If concavity fails to hold globally, the uniqueness of the argmax and the identification of the optimal policy as a two-layer fixed point of the Iq-function both collapse.

    Authors: The referee correctly identifies that preservation of concavity under the mean-field interaction with controlled common noise is not explicitly verified in the current draft. The nonlinear functional arises from the common noise term in the exploratory HJB, but the entropy regularization and the linear-quadratic structure in the general LQ case ensure concavity is retained (as confirmed by the explicit Gaussian characterization). For the general case, we will add a dedicated remark or short appendix showing that the concavity condition is preserved because the mean-field interaction enters linearly in the dynamics and the objective remains jointly concave in the policy for fixed distributions. If this verification requires additional assumptions, we will state them explicitly rather than claiming global validity. revision: partial

Circularity Check

0 steps flagged

Minor self-citation to prior single-agent q-function; central MFC derivations remain independent

full rationale

The paper cites Jia and Zhou (2023) only to name the base q-function concept from the single-agent case and then extends it to mean-field control with common noise via standard HJB and relaxed-control machinery. The existence/uniqueness result under the concavity assumption, the two-layer fixed-point identification of the optimal policy via the newly introduced Iq-function, and the Gaussian characterization in the LQ case are all developed from first principles within this manuscript without reducing to the citation by construction or to any fitted input. No self-definitional loops, ansatz smuggling, or renaming of known results occur.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on a concavity assumption whose precise scope is not detailed in the abstract, plus standard background results from stochastic control; the integrated q-function is a newly introduced object whose properties are derived within the paper.

axioms (1)
  • domain assumption Certain concavity condition
    Invoked to establish existence and uniqueness of the optimal one-step policy iteration via first-order condition.
invented entities (1)
  • Integrated q-function (Iq-function) no independent evidence
    purpose: Defined on the state distribution and the policy to serve as the object whose argmax fixed point identifies an optimal policy.
    Newly introduced in the mean-field setting; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5543 in / 1520 out tokens · 94108 ms · 2026-05-07T09:11:32.756135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Bender and N

    C. Bender and N. T. Thuan (2024): On the grid-sampling limit SDE. Preprint, available at arXiv:2410.07778

  2. [2]

    L. Bo, Y. Huang and X. Yu (2025): On optimal tracking portfolio in incomplete markets: The reinforcement learning approach. SIAM Journal on Control and Optimization . 63(1), 321-348

  3. [3]

    Buckdahn, J

    R. Buckdahn, J. Li, S. Peng, C. Rainer. Mean-field stochastic differential equations and associated PDEs. Annals of Probability . 45(2):824-878

  4. [4]

    Buckdahn, Y

    R. Buckdahn, Y. Chen. and J. Li (2021): Partial derivative with respect to the measure and its application to general controlled mean-field systems. Stochastic Processes and their Applications . 134: 265-307

  5. [5]

    Carmona, F

    R. Carmona, F. Delarue and A. Lachapelle (2013): Control of McKean-Vlasov dynamics versus mean field games. Mathematics and Financial Economics . 7, 131-166

  6. [6]

    Carmona and F

    R. Carmona and F. Delarue (2018a): Probabilistic Theory of Mean Field Games with Applications, Vol I. Springer

  7. [7]

    Carmona and F

    R. Carmona and F. Delarue (2018b): Probabilistic Theory of Mean Field Games with Applications, Vol II. Springer

  8. [8]

    Carmona and M

    R. Carmona and M. Lauri\`ere (2025): Reconciling Discrete-Time Mixed Policies and Continuous-Time Relaxed Controls in Reinforcement Learning and Stochastic Control. Preprint, available at arXiv:2504.21793

  9. [9]

    Carmona, M

    R. Carmona, M. Lauri\`ere. and Z. Tan. (2023): Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. Annals of Applied Probability . 33(6B), 5334-5381

  10. [10]

    Chassagneux, D

    J.F. Chassagneux, D. Crisan, and F. Delarue (2022): A probabilistic approach to classical solutions of the master equation for large population equilibria. Memoirs of the AMS ,volume 280

  11. [11]

    Cheung, J

    H. Cheung, J. Qiu and A. Badescu (2023): A viscosity solution theory of stochastic Hamilton-Jacobi-Bellman equations in the Wasserstein space. Preprint, available at arXiv:2310.14446

  12. [12]

    Conforti, A

    G. Conforti, A. Kazeykina, Z. Ren (2023): Game on random environment, mean-field Langevin system, and neural networks. Mathematics of Operations Research . 48(1):78-99

  13. [13]

    Crisan and E

    D. Crisan and E. McMurray (2018): Smoothing properties of McKean–Vlasov SDEs. Probability Theory and Related Fields , 171:97–148

  14. [14]

    M. Dai, Y. Dong and Y. Jia (2023): Learning equilibrium mean-variance strategy. Mathematical Finance . 33(4), 1166-1212

  15. [15]

    M. Dai, Y. Dong, Y. Jia and X. Y. Zhou (2023): Data-driven Merton's strategies via policy randomization. Preprint, available at arXiv:2312.11797

  16. [16]

    M. F. Djete, D. Possama\"i and X. Tan (2022): McKean–Vlasov optimal control: the dynamic programming principle. The Annals of Probability . 50(2):791-833

  17. [17]

    Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm

    Y. Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization . 62(3), 1590-1614

  18. [18]

    Dupuis, R

    P. Dupuis, R. S. Ellis (2011): A weak convergence approach to the theory of large deviations. John Wiley & Sons

  19. [19]

    Kallenberg(2002): Foundations of Modern Probability

    O. Kallenberg(2002): Foundations of Modern Probability. Probability and its Applications (New York). Springer Verlag, New York, second edition

  20. [20]

    Lacker (2015): Mean field games via controlled martingale problems: existence of Markovian equilibria

    D. Lacker (2015): Mean field games via controlled martingale problems: existence of Markovian equilibria. Stochastic Processes and their Applications . 125(7):2856-2894

  21. [21]

    Frikha, M

    N. Frikha, M. Germain, M. Lauri\`ere, H. Pham. and X. Song (2023). Actor-Critic learning for mean-field control in continuous time. Journal of Machine Learning Research . 26(127):1-42

  22. [22]

    Graber(2016): Linear quadratic mean field type control and mean field games with common noise, with applications to production of an exhaustible resource

    P. Graber(2016): Linear quadratic mean field type control and mean field games with common noise, with applications to production of an exhaustible resource. Applied Mathematics & Optimization . 74, 459-486

  23. [23]

    H. Gu, X. Guo, X. Wei and R. Xu (2021): Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM Journal on Mathematics of Data Science . 3(4), 1168-1196

  24. [24]

    H. Gu, X. Guo, X. Wei and R. Xu (2023): Dynamic programming principles for mean-field controls with learning. Operations Research . 71(4), 1040-1054

  25. [25]

    X. Guo, R. Xu and T. Zariphopoulou (2022): Entropy regularization for mean field games with learning. Mathematics of Operations Research . 47(4), 3239-3260

  26. [26]

    X. Han, R. Wang and X. Y. Zhou (2023): Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization . 61(5), 2777-2801

  27. [27]

    Huang, R.P

    M. Huang, R.P. Malham\'e, P. E. Caines (2006): Large population stochastic dynamic games closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information and Systems . 6(3), 221–252

  28. [28]

    Continuous-time reinforcement learning for optimal switching over multiple regimes.Preprint, available at arXiv:2512.04697, 2025

    Y. Huang, M. Li, X. Yu and Z. Zhou (2025): Continuous-time reinforcement learning for optimal switching over multiple regimes. Preprint, available at arXiv:2512.04697

  29. [29]

    Jia and X

    Y. Jia and X. Y. Zhou (2022a): Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research . 23, 1-50

  30. [30]

    Jia and X

    Y. Jia and X. Y. Zhou (2022b): Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research . 23, 1-55

  31. [31]

    Jia and X

    Y. Jia and X. Y. Zhou (2023): q-learning in continuous time. Journal of Machine Learning Research . 24, 1-61

  32. [32]

    Y. Jia, D. Ouyang and Y. Zhang(2025): Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning. SIAM Journal on Control and Optimization , forthcoming

  33. [33]

    Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty

    Y. Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty. Applied Mathematics & Optimization , forthcoming

  34. [34]

    J. M. Lasry and P. L. Lions (2007): Mean field games. Japanese Journal of Mathematics . 2(1), 229-260

  35. [35]

    Liang, Z

    H. Liang, Z. Chen and K. Jing (2024): Actor-critic reinforcement learning algorithms for mean field games in continuous time, state and action spaces. Applied Mathematics and Optimization . 89(3): 72

  36. [36]

    P. L. Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens. Audio Conference

  37. [37]

    R. J. McCann (1997): A convexity principle for interacting gases. Advances in Mathematics . 128(1): 153-179

  38. [38]

    Motte and H

    M. Motte and H. Pham (2022): Mean-field Markov decision processes with common noise and open-loop controls. Annals of Applied Probability . 32(2):1421-1458

  39. [39]

    H. Pham. and X. Wei (2017): Dynamic programming for optimal control of stochastic McKean--Vlasov dynamics. SIAM Journal on Control and Optimization . 55(2), 1069-1101

  40. [40]

    Z. Ren, X. Wei, X. Yu and X. Y. Zhou (2026): Continuous-time q-learning for mean-field control with common noise, part-II: q-learning algorithms. Working paper

  41. [41]

    Stroock and S

    D. Stroock and S. Varadhan (1997): Multidimensional diffusion processes, volume 233 of Grundlehren der mathematischen Wissenschaften . Springer–Verlag Berlin Heidelberg

  42. [42]

    Szpruch, T

    L. Szpruch, T. Treetanthiploet and Y. Zhang (2024): Optimal scheduling of entropy regularizer for continuous-time linear-quadratic reinforcement learning. SIAM Journal on Control and Optimization . 62(1):135–166

  43. [43]

    Villani (2009): Optimal transport: old and new

    C. Villani (2009): Optimal transport: old and new. Berlin: Springer

  44. [44]

    H. Wang, T. Zariphopoulou and X. Y. Zhou (2020): Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research . 21(1):8145-8178

  45. [45]

    Gao and L

    Wang, B., X. Gao and L. Li (2023): Reinforcement learning for continuous-time optimal execution: Actor-Critic algorithm and error analysis. Finance and Stochastics , 30, 597-655

  46. [46]

    Watkins and P

    C. Watkins and P. Dayan (1992): Q-learning. Machine Learning . 8(3):279-292

  47. [47]

    Wei and X

    X. Wei and X. Yu (2025): Continuous-time q-learning for mean-field control problems. Applied Mathematics and Optimization . 91: 10

  48. [48]

    X. Wei, X. Yu and F. Yuan (2024): Unified continuous-time q-learning for mean-field game and mean-field control problems. Preprint, available at arXiv:2407.04521

  49. [49]

    Wonham (1968): On a matrix Riccati equation of stochastic control

    W. Wonham (1968): On a matrix Riccati equation of stochastic control. SIAM Journal on Control and Optimization , 6(4):681-697

  50. [50]

    Yong (2013): Linear-quadratic optimal control problems for mean-field stochastic differential equations

    J. Yong (2013): Linear-quadratic optimal control problems for mean-field stochastic differential equations. SIAM journal on Control and Optimization . 51(4):2809-38

  51. [51]

    J. Zhou, N. Touzi, and J. Zhang (2024): Viscosity solutions for HJB equations on the process space: Application to mean field control with common noise. Preprint, available at arXiv:2401.04920