pith. sign in

arxiv: 2605.02705 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.NI

Federated Reinforcement Learning for Efficient Mobile Crowdsensing under Incomplete Information

Pith reviewed 2026-05-09 15:48 UTC · model grok-4.3

classification 💻 cs.LG cs.NI
keywords federated reinforcement learningmobile crowdsensingincomplete informationtask participation strategyenergy harvestingproximal policy optimizationdecentralized learning
0
0 comments X

The pith

A federated deep reinforcement learning method lets each mobile device learn its own sensing-task participation strategy from local data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile crowdsensing platforms publish tasks that phones and other devices can accept for payment, but the best participation choices depend on future task loads, other devices' choices, and each device's own changing energy supply. Because perfect future knowledge is unavailable, devices must learn good policies from their own incomplete, time-varying experiences. The paper introduces FDRL-PPO, a fully decentralized algorithm in which each device trains a local policy with proximal policy optimization and periodically averages only the learned model parameters with others. This exchange lets devices collectively compensate for the fragmented training data caused by intermittent energy harvesting, producing participation rules that raise the fraction of completed tasks while lowering energy use and proposal conflicts.

Core claim

FDRL-PPO enables every mobile unit to learn an effective task-participation policy from its own experiences, resources, and preferences by exchanging only model parameters through federated learning, thereby achieving robust strategies under incomplete and non-causal information about the mobile crowdsensing system.

What carries the argument

FDRL-PPO, a fully decentralized federated proximal-policy-optimization algorithm in which each mobile unit maintains and periodically averages a local actor-critic model without sharing raw experience trajectories.

If this is right

  • Task completion ratio and fairness both increase relative to non-federated and centralized benchmarks.
  • Energy consumption per completed task decreases because devices avoid proposing when their remaining battery is low.
  • The number of conflicting proposals falls as devices learn to coordinate implicitly through shared model parameters.
  • The approach scales to large numbers of devices because no central coordinator collects raw data or solves a global optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same federated-sharing pattern could be applied to other decentralized resource-allocation problems such as edge-computing task offloading where local energy or compute budgets fluctuate.
  • If communication of model updates is itself costly, a sparse or event-triggered averaging schedule would be a direct next step to test.
  • The method's robustness to heterogeneous device capabilities suggests it may also handle privacy constraints that forbid even model sharing in some regulatory settings.

Load-bearing premise

That averaging only the learned model parameters is sufficient to overcome the fragmentation of each device's training data caused by time-varying energy availability.

What would settle it

A controlled simulation in which energy harvesting rates vary so rapidly that each device's local experience set becomes statistically independent of the others; if FDRL-PPO then converges to the same performance as isolated single-agent PPO, the federated-sharing benefit does not hold.

Figures

Figures reproduced from arXiv: 2605.02705 by Andrea Ortiz, Anja Klein, Oliver Hinz, Patrick Weber, Sumedh J. Dongare, Walid Saad.

Figure 1
Figure 1. Figure 1: A sample MCS system model II. SYSTEM MODEL We consider an MCS system which consists of an MCSP that sequentially publishes sensing tasks, and a set K of K MUs with EH capabilities. We divide time into T discrete time steps with indices t ∈ {0, 1, . . . , T − 1} of duration τ int each. In each time step t, the MCSP publishes a set Ot = {On,t} of tasks where n ∈ {0, 1, . . . , N − 1} represents the task inde… view at source ↗
Figure 2
Figure 2. Figure 2: Scenario 1: Comparison of different performance metrics for baseline scenario view at source ↗
Figure 3
Figure 3. Figure 3: Training performances of RL-based benchmarks 0 50 100 150 200 250 300 Timesteps 0 1 2 3 4 5 Average per MU reward New MUs join MUs drop-out view at source ↗
Figure 5
Figure 5. Figure 5: Scenario 2: Comparison of different performance metrics for varying number of available MUs view at source ↗
Figure 6
Figure 6. Figure 6: Scenario 3: Comparison of different performance metrics for varying number of available tasks per time step view at source ↗
Figure 7
Figure 7. Figure 7: Scenario 4: Comparison of different performance metrics for varying task budget coefficient view at source ↗
Figure 8
Figure 8. Figure 8: Number of tasks per task type in the dataset 0 1 2 3 4 5 6 7 8 9 Task types 0 10 20 30 Average budget view at source ↗
Figure 10
Figure 10. Figure 10: Scenario 2: Comparison of different performance metrics for varying number of available MUs view at source ↗
Figure 11
Figure 11. Figure 11: Scenario 3: Comparison of different performance metrics for varying number of available tasks per time step view at source ↗
Figure 12
Figure 12. Figure 12: Scenario 4: Comparison of different performance metrics for varying task budget coefficient view at source ↗
read the original abstract

Mobile crowdsensing (MCS) is a distributed sensing architecture that utilizes existing sensors on mobile units (MUs) to perform sensing tasks. A mobile crowdsensing platform (MCSP) publishes the sensing tasks and the MUs decide whether to participate in exchange for money. The MCS system is dynamic: the task requirements, the MUs' availability, and their available resources change over time. The MUs aim to find an efficient task participation strategy to maximize their income while the MCSP focuses on maximizing the number of completed tasks. As optimal strategies require perfect non-causal information about the MCS system, which is unavailable in realistic scenarios, the main challenge is to find an efficient task participation strategy for the MUs under incomplete information. To this end, a novel fully decentralized federated deep reinforcement learning algorithm, FDRL-PPO, is proposed. FDRL-PPO enables every MU to learn its own task participation strategy based on its experiences, available resources, and preferences, without relying on perfect non-causal information about the MCS system. To replenish their batteries, the MUs rely on energy harvesting. As a result, their available energy varies over time, leading to varying availability and fragmented learning experiences. To mitigate these challenges, the proposed approach leverages federated learning, enabling MUs to collaboratively improve their models without sharing private raw data like their own experiences. By exchanging only learned models, MUs collectively compensate for individual limitations, and find more scalable, robust, and efficient task participation strategies. Comprehensive evaluations on both synthetic and real-world datasets show that FDRL-PPO consistently outperforms benchmark algorithms in terms of task completion ratio, fairness in task completion, energy consumption, and number of conflicting proposals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FDRL-PPO, a fully decentralized federated deep reinforcement learning algorithm based on proximal policy optimization (PPO) for mobile units (MUs) in mobile crowdsensing (MCS). Each MU independently learns a task participation policy from its local experiences, energy availability, resources, and preferences without requiring perfect non-causal system information. Energy harvesting introduces time-varying availability and fragmented trajectories; federated learning mitigates this by exchanging only model parameters (not raw data) so that MUs collectively improve robustness and scalability. Comprehensive experiments on synthetic and real-world datasets are reported to show consistent gains over benchmarks in task completion ratio, fairness, energy consumption, and number of conflicting proposals.

Significance. If the central claims hold, the work would demonstrate a practical, privacy-preserving route to decentralized RL decision-making in non-stationary, resource-constrained distributed sensing systems. It directly tackles the realistic constraint of incomplete information and energy variability that standard centralized or fully local RL approaches cannot handle, potentially informing deployment of MCS platforms where MUs must balance income, battery life, and system-wide task coverage without global state.

major comments (2)
  1. [§4] §4 (FDRL-PPO algorithm description): the mechanism by which federated aggregation compensates for highly fragmented, non-stationary local trajectories caused by energy harvesting is not specified. No details are given on aggregation weights, number of local PPO epochs per round, or any non-stationarity handling (e.g., experience replay buffers or adaptive learning rates). Because PPO is known to be sensitive to experience distribution mismatch, the load-bearing claim that 'exchanging only learned models' yields robust strategies cannot be verified from the presented material.
  2. [§5] §5 (Performance evaluation): the reported outperformance lacks (i) explicit baseline definitions and hyper-parameters, (ii) ablation isolating the federated component from local PPO, (iii) quantitative tables with means, standard deviations, or statistical tests, and (iv) any analysis of convergence behavior under different energy-harvesting rates. Without these, the abstract's claim of 'consistent outperformance' and robustness under incomplete information remains ungrounded.
minor comments (2)
  1. [Abstract] Abstract: states 'comprehensive evaluations' and 'consistent outperformance' yet supplies no numerical values, baseline names, or dataset sizes; this reduces the abstract's utility as a standalone summary.
  2. [§3] Notation: the state, action, and reward definitions for the per-MU MDP are introduced without a compact table or explicit transition probabilities, making it harder to reproduce the environment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our presentation of FDRL-PPO. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (FDRL-PPO algorithm description): the mechanism by which federated aggregation compensates for highly fragmented, non-stationary local trajectories caused by energy harvesting is not specified. No details are given on aggregation weights, number of local PPO epochs per round, or any non-stationarity handling (e.g., experience replay buffers or adaptive learning rates). Because PPO is known to be sensitive to experience distribution mismatch, the load-bearing claim that 'exchanging only learned models' yields robust strategies cannot be verified from the presented material.

    Authors: We agree that §4 does not provide sufficient implementation-level details on the federated aggregation step and its interaction with non-stationary, fragmented trajectories. The manuscript describes the high-level process of local PPO updates followed by parameter exchange to enable collective learning, but omits explicit specifications such as aggregation weights, the number of local epochs, and explicit non-stationarity mitigations. This limits verifiability of the robustness claim. In the revised manuscript we will expand §4 to specify the aggregation procedure (uniform averaging of parameters from participating MUs), the number of local PPO epochs per round, the maintenance of local experience replay buffers to retain recent trajectories despite energy-induced interruptions, and the role of PPO's clipped surrogate objective in limiting the impact of distribution shifts. These additions will make the compensation mechanism explicit and allow readers to verify how model exchange yields more robust strategies. revision: yes

  2. Referee: [§5] §5 (Performance evaluation): the reported outperformance lacks (i) explicit baseline definitions and hyper-parameters, (ii) ablation isolating the federated component from local PPO, (iii) quantitative tables with means, standard deviations, or statistical tests, and (iv) any analysis of convergence behavior under different energy-harvesting rates. Without these, the abstract's claim of 'consistent outperformance' and robustness under incomplete information remains ungrounded.

    Authors: We concur that the evaluation in §5 would be substantially strengthened by more complete and quantitative reporting. While the manuscript defines the benchmark algorithms and reports comparative results on synthetic and real-world datasets, it does not include a dedicated hyper-parameter table, an explicit ablation isolating the federated aggregation from standalone local PPO, statistical summaries (means, standard deviations, significance tests), or convergence analysis across varying energy-harvesting rates. We will revise §5 to add: (i) a table listing all hyper-parameters and baseline configurations, (ii) a dedicated ablation study comparing FDRL-PPO against its non-federated local-PPO counterpart, (iii) tables reporting mean and standard deviation over multiple independent runs together with statistical tests, and (iv) additional figures and discussion of convergence behavior under low, medium, and high energy-harvesting rates. These changes will directly support the claims of consistent outperformance and robustness under incomplete information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithm proposal evaluated externally

full rationale

The paper proposes the FDRL-PPO algorithm as a novel method for decentralized federated RL in MCS under incomplete information and energy harvesting constraints. It describes the approach, including model exchange via federated learning to address fragmented experiences, and validates performance through simulations on synthetic and real-world datasets. No derivation chain, equation, or claim reduces by construction to fitted parameters, self-citations, or renamed inputs; the central results are empirical comparisons against benchmarks rather than tautological redefinitions. This matches the default expectation of self-contained algorithmic work with external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions of reinforcement learning (MDP formulation for task decisions) and federated learning (model averaging improves local policies without data sharing). No new physical entities or ad-hoc constants are introduced beyond typical RL hyperparameters.

axioms (2)
  • domain assumption Mobile units' decisions can be modeled as a Markov decision process with states based on local resources and preferences.
    Implicit in the use of RL for task participation strategy learning.
  • domain assumption Federated averaging of models compensates for individual fragmented experiences without introducing bias or instability.
    Central to the claim that collaborative model exchange yields robust strategies.

pith-pipeline@v0.9.0 · 5625 in / 1349 out tokens · 36839 ms · 2026-05-09T15:48:29.267785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Federated deep reinforcement learning for task participation in mobile crowdsensing,

    S. Dongare, A. Ortiz, and A. Klein, “Federated deep reinforcement learning for task participation in mobile crowdsensing,” inIEEE Global Commun. Conf., 2023, pp. 4436–4441

  2. [2]

    Mobile crowdsensing: current state and future challenges,

    R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state and future challenges,”IEEE Commun. Mag., vol. 49, no. 11, pp. 32–39, 2011

  3. [3]

    Stable task assignment for mobile crowdsensing with budget constraint,

    C. Dai, X. Wang, K. Liu, D. Qiet al., “Stable task assignment for mobile crowdsensing with budget constraint,”IEEE Trans. on Mobile Comput., vol. 20, no. 12, pp. 3439–3452, 2021

  4. [4]

    Mobile crowd sensing for internet of things: A credible crowdsourcing model in mobile-sense service,

    J. An, X. Gui, J. Yang, S. Yuet al., “Mobile crowd sensing for internet of things: A credible crowdsourcing model in mobile-sense service,” in IEEE Int. Conf. on Multimedia Big Data, 2015, pp. 92–99

  5. [5]

    [Online]

    Foursquare. [Online]. Available: https://location.foursquare.com/

  6. [6]

    [Online]

    Komoot. [Online]. Available: https://www.komoot.com/

  7. [7]

    Spatial-temporal coverage maximization in vehicle-based mobile crowdsensing for air quality monitoring,

    T. A. N. Dinh, A. D. Nguyen, T. T. Nguyen, T. H. Nguyen et al., “Spatial-temporal coverage maximization in vehicle-based mobile crowdsensing for air quality monitoring,” inIEEE Wireless Commun. and Networking Conf. (WCNC), 2022, pp. 1449–1454

  8. [8]

    Optimal mobile crowd- sensing incentive under sensing inaccuracy,

    X. Dong, Z. You, T. H. Luan, Q. Yaoet al., “Optimal mobile crowd- sensing incentive under sensing inaccuracy,”IEEE IoT Journal, vol. 8, no. 10, pp. 8032–8043, 2021

  9. [9]

    Requirements for a flexible and generic API enabling mobile crowdsensing mhealth applications,

    R. Pryss, J. Schobel, and M. Reichert, “Requirements for a flexible and generic API enabling mobile crowdsensing mhealth applications,” inInt. Workshop on Requirements Engineering for Self-Adaptive, Col- laborative, and Cyber Physical Systems (RESACS), 2018, pp. 24–31

  10. [10]

    Task schedul- ing for energy-harvesting-based iot: A survey and critical analysis,

    M. M. Sandhu, S. Khalifa, R. Jurdak, and M. Portmann, “Task schedul- ing for energy-harvesting-based iot: A survey and critical analysis,” IEEE IoT Journal, vol. 8, no. 18, pp. 13 825–13 848, 2021

  11. [11]

    The market for lemons: Quality uncertainty and the market mechanism,

    G. A. Akerlof, “The market for lemons: Quality uncertainty and the market mechanism,”Quarterly Journal of Economics, vol. 84, pp. 488– 500, 1970

  12. [12]

    Information asymmetry in management research: Past accomplishments and future opportunities,

    D. D. Bergh, D. J. Ketchen Jr, I. Orlandi, P. P. Heugenset al., “Information asymmetry in management research: Past accomplishments and future opportunities,”Journal of management, vol. 45, no. 1, pp. 122–158, 2019

  13. [13]

    Multi-stakeholder ser- vice placement via iterative bargaining with incomplete information,

    A. Sterz, P. Felka, B. Simon, S. Kloset al., “Multi-stakeholder ser- vice placement via iterative bargaining with incomplete information,” IEEE/ACM Trans. on Netw., vol. 30, no. 4, pp. 1822–1837, 2022

  14. [14]

    A bargaining approach for service placement in multi-access edge computing with information asymmetries,

    B. Simon, P. Adrian, P. Weber, P. Felkaet al., “A bargaining approach for service placement in multi-access edge computing with information asymmetries,”IEEE Trans. on Mob. Comput., vol. 24, no. 6, pp. 5464– 5481, 2025

  15. [15]

    Deep reinforcement learning for task allocation in energy harvesting mobile crowdsensing,

    S. Dongare, A. Ortiz, and A. Klein, “Deep reinforcement learning for task allocation in energy harvesting mobile crowdsensing,” inIEEE Global Commun. Conf., 2022, pp. 269–274

  16. [16]

    Two-sided learning: A techno-economic view of mobile crowdsensing under incomplete information,

    S. Dongare, B. Simon, A. Ortiz, and A. Klein, “Two-sided learning: A techno-economic view of mobile crowdsensing under incomplete information,” inIEEE Int. Conf. on Commun., 2024

  17. [17]

    Decentralized online learning in task assignment games for mobile crowdsensing,

    B. Simon, A. Ortiz, W. Saad, and A. Klein, “Decentralized online learning in task assignment games for mobile crowdsensing,”IEEE Trans. on Commun., vol. 72, no. 8, pp. 4945–4960, 2024

  18. [18]

    OPAT: Optimized allocation of time-dependent tasks for mobile crowdsensing,

    Y . Huang, H. Chen, G. Ma, K. Linet al., “OPAT: Optimized allocation of time-dependent tasks for mobile crowdsensing,”IEEE Trans. on Industrial Informatics, vol. 18, no. 4, pp. 2476–2485, 2022

  19. [19]

    Decentralized task assignment for mobile crowd- sensing with multi-agent deep reinforcement learning,

    C. Xu and W. Song, “Decentralized task assignment for mobile crowd- sensing with multi-agent deep reinforcement learning,”IEEE IoT Jour- nal, vol. 10, no. 18, pp. 16 564–16 578, 2023

  20. [20]

    Towards personalized task- oriented worker recruitment in mobile crowdsensing,

    Z. Wang, J. Zhao, J. Hu, T. Zhuet al., “Towards personalized task- oriented worker recruitment in mobile crowdsensing,”IEEE Trans. on Mob. Comput., vol. 20, no. 5, pp. 2080–2093, 2021

  21. [21]

    Multi-armed bandits based task selection of a mobile crowdsensing worker,

    Q. Sima, G. Gao, H. Huang, Y .-E. Sunet al., “Multi-armed bandits based task selection of a mobile crowdsensing worker,” inInt. Conf. on Comp. Commun. and Netw. (ICCCN), 2022, pp. 1–10

  22. [22]

    Distributed time- sensitive task selection in mobile crowdsensing,

    M. H. Cheung, F. Hou, J. Huang, and R. Southwell, “Distributed time- sensitive task selection in mobile crowdsensing,”IEEE Trans. on Mob. Comput., vol. 20, no. 6, pp. 2172–2185, 2021

  23. [23]

    Multi-task allocation under time constraints in mobile crowdsensing,

    X. Li and X. Zhang, “Multi-task allocation under time constraints in mobile crowdsensing,”IEEE Trans. on Mob. Comput., vol. 20, no. 4, pp. 1494–1510, 2021

  24. [24]

    Adaptive budgeting for collabo- rative multi-task data collection in online sparse crowdsensing,

    C. Tu, Z. Yu, L. Han, X. Guoet al., “Adaptive budgeting for collabo- rative multi-task data collection in online sparse crowdsensing,”IEEE Trans. on Mob. Comput., vol. 23, no. 7, pp. 7983–7998, 2024

  25. [25]

    Multi-task allocation in mobile crowd sensing with mobility prediction,

    J. Zhang and X. Zhang, “Multi-task allocation in mobile crowd sensing with mobility prediction,”IEEE Trans. on Mob. Comput., vol. 22, no. 2, pp. 1081–1094, 2023

  26. [26]

    A UA V-Assisted Multi-Task Allocation Method for Mobile Crowd Sensing,

    H. Gao, J. Feng, Y . Xiao, B. Zhanget al., “A UA V-Assisted Multi-Task Allocation Method for Mobile Crowd Sensing,”IEEE Trans. on Mob. Comput., vol. 22, no. 7, pp. 3790–3804, 2023

  27. [27]

    Delay- and incentive- aware crowdsensing: A stable matching approach for coverage maxi- mization,

    B. Simon, S. Dongare, T. Mahn, A. Ortizet al., “Delay- and incentive- aware crowdsensing: A stable matching approach for coverage maxi- mization,” inIEEE Int. Conf. on Commun., 2022, pp. 2984–2989

  28. [28]

    A unified framework for joint sensing and communication in resource constrained mobile edge networks,

    X. Li, G. Feng, Y . Sun, S. Qinet al., “A unified framework for joint sensing and communication in resource constrained mobile edge networks,”IEEE Trans. on Mob. Comput., vol. 22, no. 10, pp. 5643– 5656, 2023

  29. [29]

    Intelligent task allocation for mobile crowdsensing with graph attention network and deep reinforcement learning,

    C. Xu and W. Song, “Intelligent task allocation for mobile crowdsensing with graph attention network and deep reinforcement learning,”IEEE Trans. on Netw. Sci. and Engg., vol. 10, no. 2, pp. 1032–1048, 2023

  30. [30]

    Advancing security and trust in wsns: A federated multi-agent deep reinforcement learning approach,

    H. Moudoud, Z. A. E. Houda, and B. Brik, “Advancing security and trust in wsns: A federated multi-agent deep reinforcement learning approach,” IEEE Transactions on Consumer Electronics, vol. 70, no. 4, pp. 6909– 6918, 2024

  31. [31]

    A privacy-preserving collaborative jamming attacks detection framework using federated learning,

    Z. A. E. Houda, D. Naboulsi, and G. Kaddoum, “A privacy-preserving collaborative jamming attacks detection framework using federated learning,”IEEE Internet of Things Journal, vol. 11, no. 7, pp. 12 153– 12 164, 2024

  32. [32]

    Mobility-aware cooperative caching in iovs based on secure asynchronous federated and deep reinforcement learning,

    X. Nie, C. Wang, T. Zhou, Q. Zhouet al., “Mobility-aware cooperative caching in iovs based on secure asynchronous federated and deep reinforcement learning,”IEEE Internet of Things Journal, vol. 12, no. 12, pp. 20 572–20 588, 2025

  33. [33]

    A global orchestration matching framework for energy-efficient multi-access edge computing,

    T. Mahn and A. Klein, “A global orchestration matching framework for energy-efficient multi-access edge computing,” inIEEE 10th Interna- tional Conference on Cloud Networking (CloudNet), 2021, pp. 11–18

  34. [34]

    Proximal policy optimization algorithms,

    J. Schulman, F. Wolski, P. Dhariwal, A. Radfordet al., “Proximal policy optimization algorithms,” 2017

  35. [35]

    Monopoly, non-linear pricing and imperfect information: the insurance market,

    J. E. Stiglitz, “Monopoly, non-linear pricing and imperfect information: the insurance market,”The Review of Economic Studies, vol. 44, no. 3, pp. 407–430, 1977

  36. [36]

    To share or not to share: Demand forecast sharing in a distribution channel,

    B. Jiang, L. Tian, Y . Xu, and F. Zhang, “To share or not to share: Demand forecast sharing in a distribution channel,”Marketing Science, vol. 35, no. 5, pp. 800–809, 2016

  37. [37]

    Communication-efficient learning of deep networks from decentralized data,

    H. B. McMahan, E. Moore, D. Ramage, S. Hampsonet al., “Communication-efficient learning of deep networks from decentralized data,” 2023

  38. [38]

    Neural trust region/proximal policy optimization attains globally optimal policy,

    B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural trust region/proximal policy optimization attains globally optimal policy,” inAdvances in Neural Information Processing Systems, vol. 32, 2019

  39. [39]

    Approximately optimal approximate re- inforcement learning,

    S. Kakade and J. Langford, “Approximately optimal approximate re- inforcement learning,” inProc. of the 19th Intl. Conf. on Machine Learning, San Francisco, CA, USA, 2002, p. 267–274

  40. [40]

    Is independent learning all you need in the starcraft multi-agent challenge?

    C. S. de Witt, T. Gupta, D. Makoviichuk, V . Makoviychuket al., “Is independent learning all you need in the starcraft multi-agent challenge?” 2020

  41. [41]

    The surprising effectiveness of PPO in cooperative multi-agent games,

    C. Yu, A. Velu, E. Vinitsky, J. Gaoet al., “The surprising effectiveness of PPO in cooperative multi-agent games,”Advances in neural information processing systems, vol. 35, pp. 24 611–24 624, 2022