pith. sign in

arxiv: 2604.27378 · v1 · submitted 2026-04-30 · 🧮 math.OC · cs.LG· cs.MA

Continuous-time q-learning for mean-field control with common noise, part-II: q-learning algorithms

Pith reviewed 2026-05-07 08:33 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.MA
keywords mean-field controlq-learningcommon noiseactor-criticlinear quadraticmartingale conditionrelaxed controlexploratory formulation
0
0 comments X

The pith

Q-learning algorithms learn optimal policies for continuous-time mean-field control with common noise by substituting observable exploratory data for non-observable relaxed distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops implementable q-learning methods for mean-field control problems that include controlled common noise. It begins with martingale conditions on value and Iq-functions evaluated along conditional state distributions generated by test policies in the relaxed control setting. Because those distributions cannot be observed directly, the authors quantify the error that arises when they are replaced by data from an exploratory formulation that uses discretely sampled actions. This error bound, together with a two-layer fixed-point characterization of optimal policies, supports concrete algorithms such as an Actor-Critic scheme. In this scheme the actor updates the policy via iteration on an improved Iq-function while the critic updates the value and Iq-functions from martingale orthogonality conditions applied to exploratory trajectories. The work also proves convergence of the inner actor iterations in the infinite-horizon linear-quadratic case and reports satisfactory numerical performance inside and outside that setting.

Core claim

Based on the relaxed control formulation, the martingale condition of the value function and the Iq-function is established by evaluating along the conditional state distributions generated by all test policies. The error incurred when these non-observable distributions are replaced by observable data from the exploratory formulation under discretely sampled actions is quantified. Combined with the two-layer fixed-point characterization of an optimal policy, this error control permits several algorithms, including an Actor-Critic q-learning procedure in which the policy is updated in the Actor step by the iteration rule induced by the improved Iq-function and the value function together with

What carries the argument

The Actor-Critic q-learning algorithm that updates the policy in the Actor-step from the iteration rule of the improved Iq-function and updates the value function and Iq-function in the Critic-step from the martingale orthogonality condition applied to exploratory data.

If this is right

  • The algorithms achieve satisfactory performance when implemented in numerical examples both within and outside the linear-quadratic framework.
  • Inner iterations of the Actor step converge in the infinite-horizon linear-quadratic setting.
  • The quantified error bound between relaxed and exploratory data justifies the use of practical observable trajectories in the learning updates.
  • The two-layer fixed-point structure of an optimal policy permits separate Actor and Critic updates without simultaneous solution of the full optimality system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the error bound extends to continuous action spaces, the same substitution technique could support q-learning in mean-field control problems without action discretization.
  • The approach may transfer to learning in mean-field games where agents face common shocks and only aggregate statistics are observable.
  • Applying the algorithms to finite-population approximations of mean-field systems would test whether the convergence properties survive the passage to the infinite-agent limit.

Load-bearing premise

The error incurred when non-observable conditional state distributions from the relaxed control formulation are replaced by observable data from the exploratory formulation under discretely sampled actions can be quantified and controlled.

What would settle it

Running the Actor-Critic algorithm on the infinite-horizon linear-quadratic example and observing that the inner policy iterations diverge from the known optimum, or that the quantified replacement error exceeds the derived bound in the reported test cases, would falsify the claims.

Figures

Figures reproduced from arXiv: 2604.27378 by Xiang Yu, Xiaoli Wei, Xun Yu Zhou, Zhenjie Ren.

Figure 2
Figure 2. Figure 2: Convergence of the learnt parameters under Algorithm view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of the learnt parameters under Algorithm view at source ↗
Figure 4
Figure 4. Figure 4: Convergence of the learnt parameters under Algorithm view at source ↗
Figure 5
Figure 5. Figure 5: Convergence of the learnt parameters under Algorithm view at source ↗
Figure 6
Figure 6. Figure 6: Convergence of the learnt parameters under Algorithm view at source ↗
read the original abstract

This paper is a continuation work of Ren et al. (2026) aiming to further devise q-learning algorithms for mean-field control (MFC) with controlled common noise. Based on the relaxed control formulation, we first establish the martingale condition of the value function and the Iq-function by evaluating along the conditional state distributions generated by all test policies. As the data in the relaxed control formulation are not observable in practice, we quantify the error incurred when they are replaced by the observable ones in the exploratory formulation under discretely sampled actions. This, together with a two-layer fixed point characterization of an optimal policy in Ren et al. (2026), allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the iteration rule induced by the improved Iq-function, and the value function and Iq-function are updated in the Critic-step based on the martingale orthogonality condition using the data from the exploratory formulation. We also establish the convergence of the inner iterations in the Actor-step in an infinite-horizon linear quadratic (LQ) framework. In two examples, within and beyond LQ framework, our q-learning algorithms are implemented with satisfactory performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This continuation paper develops q-learning algorithms for mean-field control with controlled common noise. Starting from the relaxed-control formulation, it derives martingale orthogonality conditions for the value function and Iq-function evaluated along conditional state distributions induced by test policies. It then quantifies the approximation error incurred by replacing those non-observable distributions with observable data generated by an exploratory formulation under discretely sampled actions. Combining this error control with the two-layer fixed-point characterization of an optimal policy from the authors' prior Part-I work, the paper proposes several algorithms, including an Actor-Critic scheme in which the actor updates the policy via the improved Iq-function and the critic updates the value and Iq-functions via the (approximate) martingale conditions. Convergence of the inner actor iterations is proved in the infinite-horizon linear-quadratic case, and the algorithms are illustrated numerically on both LQ and non-LQ examples.

Significance. If the error bounds are rigorous and sufficiently tight, the work supplies the first implementable q-learning procedures for MFC with common noise that are grounded in a relaxed-control martingale characterization. The explicit LQ convergence result for the actor inner loop and the numerical validation constitute concrete strengths. The contribution is incremental on the Part-I fixed-point theory but addresses a practically relevant gap between theoretical relaxed formulations and observable data.

major comments (3)
  1. [§3.2] §3.2 (error quantification between relaxed and exploratory formulations): The bound on the discrepancy between the non-observable conditional distributions appearing in the martingale orthogonality conditions and the observable data generated under discrete action sampling must be shown to remain small enough that the orthogonality relation is preserved up to a controllable perturbation. The current derivation appears to yield an O(h) term (h = sampling interval), but its dependence on the mean-field interaction Lipschitz constant and the common-noise intensity is not made explicit; without this, the transfer of the two-layer fixed-point iteration from Part I to the implementable Actor-Critic scheme is not guaranteed.
  2. [§4.3] §4.3 (Actor-Critic algorithm and inner-loop convergence): The policy-update rule in the actor step is induced by the improved Iq-function, yet the convergence proof for the inner iterations (Theorem 5.1) is stated only for the exact LQ case. It is unclear whether the proof accounts for the residual error that propagates from the critic's use of exploratory data; an explicit perturbation analysis showing that the contraction mapping remains valid when the orthogonality condition holds only approximately is required.
  3. [§5] §5 (numerical examples): The reported performance in the non-LQ example relies on the same error-controlled substitution, but no diagnostic is provided that quantifies how large the realized approximation error actually is (e.g., distance between the empirical conditional distributions and the relaxed ones). This leaves open whether the satisfactory numerical results truly validate the error-control claim or merely reflect a favorable choice of discretization.
minor comments (2)
  1. [§2] The notation for the Iq-function and the two-layer fixed-point map is introduced without a self-contained recap of the Part-I definitions; a short paragraph or table summarizing the key objects would improve readability for readers who have not yet consulted the companion paper.
  2. [§3.1] In the statement of the martingale orthogonality condition, the test-policy class is not explicitly delimited; clarifying whether the class includes only feedback policies or also open-loop controls would help readers verify that the derived conditions are sufficient for optimality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the thorough review and constructive comments on our manuscript. We address each major point below and describe the revisions that will be incorporated to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] The bound on the discrepancy between the non-observable conditional distributions appearing in the martingale orthogonality conditions and the observable data generated under discrete action sampling must be shown to remain small enough that the orthogonality relation is preserved up to a controllable perturbation. The current derivation appears to yield an O(h) term (h = sampling interval), but its dependence on the mean-field interaction Lipschitz constant and the common-noise intensity is not made explicit; without this, the transfer of the two-layer fixed-point iteration from Part I to the implementable Actor-Critic scheme is not guaranteed.

    Authors: We thank the referee for highlighting this issue. Proposition 3.2 currently establishes an O(h) error bound under standard Lipschitz assumptions on the dynamics and costs. We agree that the explicit dependence on the mean-field interaction Lipschitz constant L_mf and common-noise intensity σ should be displayed. In the revision we will expand the proof of Proposition 3.2 to obtain the sharper bound C(L_mf, σ) h, where the prefactor C grows at most linearly in L_mf and σ. With this explicit form, the perturbation to the martingale orthogonality condition remains controllable for sufficiently small h, thereby justifying the transfer of the two-layer fixed-point characterization from Part I (with an additional vanishing error term). The updated proposition and proof will appear in the revised Section 3.2. revision: yes

  2. Referee: [§4.3] The policy-update rule in the actor step is induced by the improved Iq-function, yet the convergence proof for the inner iterations (Theorem 5.1) is stated only for the exact LQ case. It is unclear whether the proof accounts for the residual error that propagates from the critic's use of exploratory data; an explicit perturbation analysis showing that the contraction mapping remains valid when the orthogonality condition holds only approximately is required.

    Authors: We appreciate this observation. Theorem 5.1 proves contraction for the exact martingale orthogonality condition in the infinite-horizon LQ setting. In the Actor-Critic algorithm the critic employs approximate data, introducing a residual error of size O(h). We will add a new perturbation lemma in Section 4.3 (immediately preceding Theorem 5.1) that shows: if the orthogonality condition holds up to an additive error ε, then the inner actor iterations converge to a policy whose value function lies within O(ε) of the optimum, while the contraction rate remains strictly less than one for small enough ε. The lemma will be proved by a standard perturbation argument on the Bellman operator and will be used to justify the practical algorithm. This material will be included in the revised manuscript. revision: yes

  3. Referee: [§5] The reported performance in the non-LQ example relies on the same error-controlled substitution, but no diagnostic is provided that quantifies how large the realized approximation error actually is (e.g., distance between the empirical conditional distributions and the relaxed ones). This leaves open whether the satisfactory numerical results truly validate the error-control claim or merely reflect a favorable choice of discretization.

    Authors: We agree that a quantitative diagnostic would strengthen the numerical validation. In the revised Section 5 we will add, for the non-LQ example, a table (or subplot) reporting the 2-Wasserstein distance between the empirical conditional distributions obtained from the exploratory data and the corresponding relaxed-control distributions at representative time instants. The table will also list the chosen sampling interval h and the resulting error magnitude (expected to be on the order of 10^{-2} or smaller). This diagnostic will confirm that the approximation error remains small for the discretization used, thereby supporting the error-control claim beyond the LQ case. revision: yes

Circularity Check

1 steps flagged

Self-cited two-layer fixed-point characterization load-bearing for algorithm proposal and convergence transfer

specific steps
  1. self citation load bearing [Abstract]
    "This, together with a two-layer fixed point characterization of an optimal policy in Ren et al. (2026), allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the iteration rule induced by the improved Iq-function, and the value function and Iq-function are updated in the Critic-step based on the martingale orthogonality condition using the data from the exploratory formulation. We also establish the convergence of the inner iterations in the Actor-step in an infinite-horizon linear quadratic (LQ)框架."

    The algorithms and their convergence rest on combining the paper's new error quantification with the two-layer fixed-point characterization of an optimal policy, which is imported wholesale from the authors' own prior work (part I). While the martingale conditions and error analysis are derived here, the fixed-point structure that justifies the Actor-step iteration rule and allows the convergence argument to transfer is not re-derived or independently verified in this manuscript, making the self-citation load-bearing for the central claims.

full rationale

The paper develops original martingale orthogonality conditions from the relaxed-control formulation and quantifies the replacement error when using observable exploratory data under discrete sampling. These steps are self-contained. However, the proposal of the Actor-Critic q-learning algorithm and the transfer of the fixed-point iteration to the implementable scheme explicitly combine the new error analysis with the two-layer fixed-point characterization of an optimal policy imported from the authors' prior paper (Ren et al. 2026). This creates a moderate load-bearing self-citation dependency for the central algorithmic framework and the claimed convergence of inner Actor iterations (even in the LQ case), though the error control and convergence arguments themselves contain independent content developed here. No reduction by construction or fitted-input renaming occurs within this paper's own derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the relaxed control formulation, the martingale condition for value and Iq-functions, the error quantification between relaxed and exploratory data, and the two-layer fixed-point characterization from the authors' prior work. No explicit free parameters or new invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Martingale condition of the value function and Iq-function obtained by evaluating along conditional state distributions generated by all test policies under the relaxed control formulation.
    Invoked to derive the orthogonality condition used in the critic step.
  • domain assumption Two-layer fixed point characterization of an optimal policy.
    Referenced from Ren et al. (2026) and used to justify the actor update rule.

pith-pipeline@v0.9.0 · 5528 in / 1838 out tokens · 126499 ms · 2026-05-07T08:33:24.105174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Ambrosio, N

    L. Ambrosio, N. Gigli. and G. Savaré (2005). Gradient flows: In metric spaces and in the space of probability measures. Springer Science & Business Media

  2. [2]

    Anahtarci, C

    B. Anahtarci, C. D. Kariksiz and N. Saldi (2022): Q-learning in regularized mean-field games. Preprint, arXiv:2003.12151

  3. [3]

    Angiuli, J

    A. Angiuli, J. P. Fouque and M. Lauri\`ere (2022). Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems . 34(2), 217-271

  4. [4]

    Angiuli, J

    A. Angiuli, J. P. Fouque, R. Hu and A. Raydan (2023a): Deep reinforcement learning for infinite horizon mean field problems in continuous spaces. Preprint, available at arXiv:2309.10953. To appear in Journal of Machine Learning

  5. [5]

    Angiuli, J.P

    A. Angiuli, J.P. Fouque, M. Lauri\`ere and M. Zhang (2023b). Convergence of multi-scale reinforcement Q-learning algorithms for mean field game and control problems. Preprint, available at arXiv:2312.06659

  6. [6]

    L. Bo, Y. Huang and X. Yu (2025): On optimal tracking portfolio in incomplete markets: The reinforcement learning approach. SIAM Journal on Control and Optimization . 63(1), 321-348

  7. [7]

    Carmona and F

    R. Carmona and F. Delarue (2018a): Probabilistic Theory of Mean Field Games with Applications, Vol I. Springer

  8. [8]

    Carmona and F

    R. Carmona and F. Delarue (2018b): Probabilistic Theory of Mean Field Games with Applications, Vol II. Springer

  9. [9]

    Carmona, J

    R. Carmona, J. P. Fouque and L. H. Sun (2015): Mean field games and systemic risk. Communications in Mathematical Sciences , 13(4):911-933

  10. [10]

    Carmona, F

    R. Carmona, F. Delarue and D. Lacker (2016): Mean field games with common noise. Annals of Probability , 44(6), 3740-3803

  11. [11]

    Carmona and M

    R. Carmona and M. Lauri\`ere (2025): Reconciling Discrete-Time Mixed Policies and Continuous-Time Relaxed Controls in Reinforcement Learning and Stochastic Control. Preprint, available at arXiv:2504.21793

  12. [12]

    Carmona, M

    R. Carmona, M. Lauri\`ere and Z. Tan. (2023): Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. Annals of Applied Probability . 33(6B), 5334-5381

  13. [13]

    Chassagneux, D

    J.F. Chassagneux, D. Crisan, and F. Delarue (2022): A probabilistic approach to classical solutions

  14. [14]

    Cheung, J

    H. Cheung, J. Qiu and A. Badescu (2023): A viscosity solution theory of stochastic Hamilton-Jacobi-Bellman equations in the Wasserstein space. Preprint, available at arXiv:2310.14446

  15. [15]

    Conforti, A

    G. Conforti, A. Kazeykina, Z. Ren (2023): Game on random environment, mean-field Langevin system, and neural networks. Mathematics of Operations Research , 48(1):78-99

  16. [16]

    Cosso, F Gozzi, I

    A. Cosso, F Gozzi, I. Kharroubi, H. Pham and M. Rosestolato (2020): Optimal control of path-dependent McKean-Vlasov SDEs in infinite dimension. Preprint, available at arXiv:2012.14772

  17. [17]

    Crisan and E

    D. Crisan and E. McMurray (2018): Smoothing properties of McKean–Vlasov SDEs. Probability Theory and Related Fields , 171:97–148

  18. [18]

    K. Cui, A. Tahir, M. Sinzger and H. Koeppl (2021): Discrete-time mean field control with environment states. In 2021 60th IEEE Conference on Decision and Control (CDC)

  19. [19]

    M. Dai, Y. Dong and Y. Jia (2023): Learning equilibrium mean-variance strategy. Mathematical Finance . 33(4), 1166-1212

  20. [20]

    M. Dai, Y. Dong, Y. Jia and X. Y. Zhou (2023): Data-driven Merton's strategies via policy randomization. Preprint, available at arXiv:2312.11797

  21. [21]

    Djete, D

    M.F. Djete, D. Possama\"i and X. Tan (2022): McKean–Vlasov optimal control: the dynamic programming principle. The Annals of Probability , 50(2):791-833

  22. [22]

    Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm

    Y. Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization . 62(3), 1590-1614

  23. [23]

    Djete, D

    M.F. Djete, D. Possama\" i and X. Tan (2022): McKean–Vlasov optimal control: limit theory and equivalence between different formulations. Mathematics of Operations Research , 47(4): 2891-2930

  24. [24]

    Dupuis, R

    P. Dupuis, R. S. Ellis (2011): A weak convergence approach to the theory of large deviations. John Wiley & Sons

  25. [25]

    Y. Duan, X. Chen, R. Houthooft, J. Schulman and P. Abbeel (2016): Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning , 1329-1338. PMLR

  26. [26]

    Doya (2020)

    K. Doya (2020). Reinforcement learning in continuous time and space. Neural Computation , 12(1):219–245

  27. [27]

    Firoozi and S

    D. Firoozi and S. Jaimungal (2022). Exploratory LQG mean field games with entropy regularization. Automatica 139:110177

  28. [28]

    Frikha, M

    N. Frikha, M. Germain, M. Lauri\`ere, H. Pham. and X. Song (2023). Actor-Critic learning for mean-field control in continuous time. Journal of Machine Learning Research . 26(127):1-42

  29. [29]

    Frikha, H

    N. Frikha, H. Pham and X. Song (2024): Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise. Preprint, available at arXiv preprint arXiv:2408.02489

  30. [30]

    Giegrich, C

    M. Giegrich, C. Reisinger and Y. Zhang (2024): Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems. SIAM Journal on Control and Optimization . 62(2):1060-92

  31. [31]

    H. Gu, X. Guo, X. Wei and R. Xu (2021): Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM Journal on Mathematics of Data Science . 3(4), 1168-1196

  32. [32]

    H. Gu, X. Guo, X. Wei and R. Xu (2022): Mean-field multi-agent reinforcement learning: A decentralized network approach. Mathematics of Operations Research . 50(1), 506-536

  33. [33]

    X. Guo, R. Xu and T. Zariphopoulou (2022): Entropy regularization for mean field games with learning. Mathematics of Operations Research . 47(4), 3239-3260

  34. [34]

    X. Han, R. Wang and X. Y. Zhou (2023): Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization . 61(5), 2777-2801

  35. [35]

    Continuous-time reinforcement learning for optimal switching over multiple regimes.Preprint, available at arXiv:2512.04697, 2025

    Y. Huang, M. Li, X. Yu and Z. Zhou (2025): Continuous-time reinforcement learning for optimal switching over multiple regimes. Preprint, available at arXiv:2512.04697

  36. [36]

    Jia and X

    Y. Jia and X. Y. Zhou (2022a): Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research . 23, 1-50

  37. [37]

    Jia and X

    Y. Jia and X. Y. Zhou (2022b): Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research . 23, 1-55

  38. [38]

    Jia and X

    Y. Jia and X. Y. Zhou (2023): q-learning in continuous time. Journal of Machine Learning Research . 24, 1-61

  39. [39]

    Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty

    Y. Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty. Applied Mathematics & Optimization , forthcoming

  40. [40]

    Y. Jia, D. Ouyang and Y. Zhang. Accuracy of discretely sampled stochastic policies in

  41. [41]

    Kallenberg(2002): Foundations of Modern Probability

    O. Kallenberg(2002): Foundations of Modern Probability. Probability and its Applications (New York). Springer Verlag, New York, second edition

  42. [42]

    V. N. Kolokoltsov and M. Troeva (2019): On mean field games with common noise and McKean-Vlasov SPDEs. Stochastic Analysis and Applications , 37(4), 522-549

  43. [43]

    Lacker and T

    D. Lacker and T. Zariphopoulou (2018): Mean field and n -agent games for optimal investment under relative performance criteria. Mathematical Finance , 29: 1003-1038

  44. [44]

    Lasry and P.L

    J.M. Lasry and P.L. Lions (2007): Mean field games. Japanese Journal of Mathematics . 2(1), 229-260

  45. [45]

    Learning in mean field games: A survey.arXiv preprint arXiv:2205.12944, 2022

    M. Lauri\`ere, S. Perrin, J. P\'erolat, S. Girgin, P. Muller, R. \'Elie, M. Geist and O. Pietquin (2022): Learning in mean field games: A survey. Preprint, available at arXiv:2205.12944

  46. [46]

    Liang, Z

    H. Liang, Z. Chen and K. Jing (2024): Actor-critic reinforcement learning algorithms for mean field games in continuous time, state and action spaces. Applied Mathematics and Optimization . 89(3): 72

  47. [47]

    Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens

    P.L. Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens. Audio Conference

  48. [48]

    R. J. McCann (1997): A convexity principle for interacting gases. Advances in Mathematics , 128(1): 153-179

  49. [49]

    Motte and H

    M. Motte and H. Pham (2022): Mean-field Markov decision processes with common noise and open-loop controls. Annals of Applied Probability , 32(2):1421-1458

  50. [50]

    Mondal, M

    W.U. Mondal, M. Agarwal, V. Aggarwal and S.V. Ukkusuri (2022): On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). Journal of Machine Learning Research , 23(129), 1-46

  51. [51]

    Mondal, V

    W.U. Mondal, V. Aggarwal and S. V. Ukkusuri (2023): Mean-field control based approximation of multi-agent reinforcement learning in presence of a non-decomposable shared global state. Preprint, available at arXiv:2301.06889

  52. [52]

    Efficient model-based multi-agent mean- field reinforcement learning.arXiv preprint arXiv:2107.04050, 2021

    B. Pasztor, I. Bogunovic and A. Krause (2021): Efficient model-based multi-agent mean-field reinforcement learning. Preprint, available at arXiv:2107.04050

  53. [53]

    H. Pham. and X. Wei (2017): Dynamic programming for optimal control of stochastic McKean--Vlasov dynamics. SIAM Journal on Control and Optimization , 55(2), 1069-1101

  54. [54]

    Pham and X

    H. Pham and X. Warin (2024): Mean-field neural networks-based algorithms for McKean-Vlasov control problems. Journal of Machine Learning , 3:176-214

  55. [55]

    Pham and X

    H. Pham and X. Warin (2023): Mean-field neural networks: learning mappings on Wasserstein space. Neural Networks , 168:380-93

  56. [56]

    Z. Ren, X. Wei, X. Yu and X. Y. Zhou (2026): Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations. Working paper

  57. [57]

    H. Wang, T. Zariphopoulou and X. Y. Zhou (2020): Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research . 21(1):8145-8178

  58. [58]

    Gao and L

    Wang, B., X. Gao and L. Li (2023): Reinforcement learning for continuous-time optimal execution: Actor-Critic algorithm and error analysis. Finance and Stochastics , 30, 597-655

  59. [59]

    C. J. Watkins (1989): Learning from delayed rewards. Ph.D. thesis, Cambridge University

  60. [60]

    Watkins and P

    C. Watkins and P. Dayan (1992): Q-learning. Machine Learning , 8(3):279-292

  61. [61]

    Wei and X

    X. Wei and X. Yu (2025): Continuous-time q-learning for mean-field control problems. Applied Mathematics and Optimization . 91: 10

  62. [62]

    X. Wei, X. Yu and F. Yuan (2024): Unified continuous-time q-learning for mean-field game and mean-field control problems. Preprint, available at arXiv:2407.04521