pith. sign in

arxiv: 2606.26574 · v1 · pith:SIM7EPJXnew · submitted 2026-06-25 · 💻 cs.LG

Revisiting Action Factorization for Complex Action Spaces

Pith reviewed 2026-06-26 05:42 UTC · model grok-4.3

classification 💻 cs.LG
keywords action factorizationhybrid action spacesreinforcement learningPPOSACbranching architecturesauto-regressive actions
0
0 comments X

The pith

VDN-PPO and PPO-MIX outperform other PPO factorizations in hybrid action spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests six factorization methods across PPO, SAC, and DQN on discretized, hybrid, and continuous action spaces in four lightweight environments. It introduces VDN-PPO and PPO-MIX, which use a branching critic to assign credit to multi-headed PPO agents, and releases two new environments to isolate state-dependent action dependencies. Results indicate that these variants beat the other PPO approaches, branching dueling networks strike the best balance between speed and score, auto-regressive factorization yields the single highest performance, and native continuous SAC exceeds discrete and hybrid versions at higher compute cost.

Core claim

Across 220 valid configurations, VDN-PPO and PPO-MIX surpass all other tested PPO factorizations. Branching dueling architectures deliver the most favorable compute-performance trade-off. Auto-regressive action selection produces the overall best scores. Native continuous SAC beats both discrete and hybrid algorithms, though it requires more computation.

What carries the argument

Branching critic that assigns credit across multiple action heads in VDN-PPO and PPO-MIX, combined with factorization schemes that decompose hybrid discrete-continuous actions.

If this is right

  • Branching dueling networks become the default choice when both speed and score matter.
  • Auto-regressive factorization should be preferred when maximum performance is the goal and compute is available.
  • Continuous SAC is the strongest option for purely continuous control despite its cost.
  • New lightweight environments can be used to benchmark future factorization methods before scaling to heavyweight simulators.
  • VDN-style credit assignment extends usefully from value-based to policy-gradient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world systems such as autonomous driving may gain immediate performance by swapping in the branching PPO variants without changing the underlying simulator.
  • The same branching critic idea could be tested inside other on-policy algorithms beyond PPO.
  • The observed cost-performance curves suggest a practical decision rule: start with branching dueling, move to auto-regressive only when extra accuracy justifies the added latency.

Load-bearing premise

The four lightweight environments capture the state-dependent inter-action dependence that appears in real hybrid control tasks.

What would settle it

Re-running the identical 220 configurations on a heavier benchmark such as CARLA or a larger multi-agent task and obtaining reversed performance orderings among the factorization methods would falsify the ranking.

Figures

Figures reproduced from arXiv: 2606.26574 by Sandip Sen, Timothy Flavin.

Figure 1
Figure 1. Figure 1: Lunar-Landerv3: Shown top left, CoopPush: Particles, Boulders, and landmarks. Default (bottom left) and independent (bottom center). Hybrid-Shoot: Targets:• Selected: o Shoot Location:o. Platform: Agent (Purple) Obstacles (Grey) 3.5 Action Embeddings Latent action embedding methods such as HyAR [14] and action representation learning [2] map hybrid spaces into compact continuous latent spaces. Because no c… view at source ↗
Figure 2
Figure 2. Figure 2: Vectorized reward curves on the Contextual-Decoupler ( [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Orange is Correlation between the true state’s active head [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean results over all environments, grouped by action datatype (columns) and algorithm family (rows). Note that hybrid [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The discrete and hybrid shoot environments are SAC’s worst performance because 300 action choices combined with a discrete [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: All datatype results aggregated by algorithm. SAC performs best with native continuous actions while PPO and DQN perform [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Across action types auto-regressive and independent actions perform best (Both because parameter count scales with action [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: "Continuous" Action types only (11 buckets per action), zoomed in from the joint figure in the paper [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: "Discrete" Action types only, zoomed in from the joint figure in the paper [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: "Hybrid" Action types only, first half discrete second half continuous, zoomed in from the joint figure in the paper [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overall, Q-PLEX performs the best when taking all action types into account, though only marginally. VDN is close behind [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: For continuous actions squashed gaussian PPO is stable with respect to factorization [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: For discrete actions, VDN and Q-PLEX perform best. KL-divergence is applied to the joint space here so we hypothesize that [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: We are unsure what it is about continuous actions’ inclusion that reduces the impact on factorization [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Joint discrete is the only true under-performer but AR-SAC is best on this bench suite. Discrete actions with an entropy [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The best performing algorithm overall on all environments [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Only Joint underperforms as in the discussion in Figure [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Auto-regressive sac in the hybrid space uses standard SAC for continuous dims and branching dueling D-SAC or Discrete-SAC [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Action Types Aggregated for DQN [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Action Types Aggregated for PPO Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Action Types Aggregated for SAC Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p038_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Non-aggregated results for Continuous Dependent Push. [PITH_FULL_IMAGE:figures/full_fig_p039_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Non-aggregated results for Continuous Dependent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p040_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Non-aggregated results for Continuous Independent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p041_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Non-aggregated results for Continuous Lander. [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Non-aggregated results for Continuous Platform. [PITH_FULL_IMAGE:figures/full_fig_p043_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Non-aggregated results for Discrete Dependent Push. [PITH_FULL_IMAGE:figures/full_fig_p044_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Non-aggregated results for Discrete Dependent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p045_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Non-aggregated results for Discrete Independent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p046_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Non-aggregated results for Discrete Lander. [PITH_FULL_IMAGE:figures/full_fig_p047_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Non-aggregated results for Discrete Platform. [PITH_FULL_IMAGE:figures/full_fig_p048_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Non-aggregated results for Hybrid Dependent Push. [PITH_FULL_IMAGE:figures/full_fig_p049_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Non-aggregated results for Hybrid Dependent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p050_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Non-aggregated results for Hybrid Independent Shoot. [PITH_FULL_IMAGE:figures/full_fig_p051_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Non-aggregated results for Hybrid Lander. [PITH_FULL_IMAGE:figures/full_fig_p052_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Non-aggregated results for Hybrid Platform. [PITH_FULL_IMAGE:figures/full_fig_p053_36.png] view at source ↗
read the original abstract

Many real-world control problems involve hybrid discrete-continuous action spaces. For example, steering and signaling in autonomous driving, and aiming and firing in robotics or video-games. Despite real-world hybrid factorization and reinforcement learning framework support for complex action spaces (e.g., Gymnasium, PettingZoo, TorchRL, SeedRL, Mujoco, etc), the default environments within those frameworks often implement uniform action space configurations (LunarLander, Walker2D, Cheetah, SMAC, SUMO, Ant, Atari). Landmark hybrid-action benchmarks (RoboCup 2D HFO, SC2LE, Platform, CARLA, etc) are mostly heavyweight or archival implementations originating from papers which test one or a small number of competing factorization methods on one kind of control. This article provides a cross-sectional study of factorization methods [independent networks, shared encoder, VDN, QPLEX, Joint, Auto-Regressive] on each of three families of algorithms [PPO, SAC, DQN] across three action spaces [discretized, hybrid, continuous] over four lightweight environments [Platform, hybrid-LunarLander, Hybrid-Shoot, CoopPush]. Accounting for some invalid pairings such as joint-continuous, we are left with 220 configurations to analyze each method. We provide two new C++ parallel gymnasium and petting-zoo compliant environments [CoopPush, Hybrid-Shoot] to isolate particular challenges such as state-dependent inter-action dependence. Finally, we introduce VDN-PPO and PPO-MIX which use a branching critic to assign credit to multi-headed PPO. These variants out-perform all other tested PPO factorizations. Our results suggest that branching dueling architectures balance compute and performance most effectively, with Auto-Regressive actions reaching the highest performance overall and native continuous SAC outperforming discrete and hybrid algorithms, albiet both at increased computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a cross-sectional empirical study of action factorization methods (independent networks, shared encoder, VDN, QPLEX, Joint, Auto-Regressive) applied to PPO, SAC, and DQN across discretized, hybrid, and continuous action spaces in four lightweight environments (Platform, hybrid-LunarLander, Hybrid-Shoot, CoopPush). It introduces two new Gymnasium/PettingZoo-compliant environments (CoopPush, Hybrid-Shoot) and proposes VDN-PPO and PPO-MIX (branching-critic variants of PPO), claiming these outperform other tested PPO factorizations. Results indicate branching dueling architectures balance compute/performance, Auto-Regressive reaches highest performance overall, and native continuous SAC outperforms discrete/hybrid variants (at higher cost), based on 220 configurations.

Significance. If the empirical claims hold after addressing verification gaps, the work offers a useful broad benchmark for hybrid action spaces, new environments targeting state-dependent inter-action dependence, and the VDN-PPO/PPO-MIX variants as practical contributions. The scale of 220 configurations and explicit comparison across algorithm families provides practitioners with trade-off insights not available in single-method papers.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (VDN-PPO and PPO-MIX outperforming other PPO factorizations; Auto-Regressive highest overall; native continuous SAC outperforming) are presented without any reference to statistical tests, error bars, number of random seeds, or hyperparameter search details. This directly affects verifiability of the outperformance assertions that form the paper's headline results.
  2. [Abstract] Abstract: The new environments are introduced 'to isolate particular challenges such as state-dependent inter-action dependence,' yet the text supplies no mechanism, state-feature definition, or example demonstrating how the coupling between discrete and continuous actions is made conditional on state (as opposed to fixed or state-independent hybrids like standard LunarLander). This is load-bearing for the claim that the 220 configurations test the motivating regime rather than uniform benchmarks.
minor comments (1)
  1. [Abstract] Abstract contains a typo: 'albiet' should be 'albeit'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract. We address each point below and will make revisions to improve verifiability and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (VDN-PPO and PPO-MIX outperforming other PPO factorizations; Auto-Regressive highest overall; native continuous SAC outperforming) are presented without any reference to statistical tests, error bars, number of random seeds, or hyperparameter search details. This directly affects verifiability of the outperformance assertions that form the paper's headline results.

    Authors: We agree the abstract should reference these details for self-containment. The experiments use 5 random seeds per configuration with standard deviation error bars in all plots and tables; hyperparameter grids are documented in the appendix. We will revise the abstract to note the seed count, error bars, and that significance testing (paired t-tests) supports the reported outperformance where claimed. revision: yes

  2. Referee: [Abstract] Abstract: The new environments are introduced 'to isolate particular challenges such as state-dependent inter-action dependence,' yet the text supplies no mechanism, state-feature definition, or example demonstrating how the coupling between discrete and continuous actions is made conditional on state (as opposed to fixed or state-independent hybrids like standard LunarLander). This is load-bearing for the claim that the 220 configurations test the motivating regime rather than uniform benchmarks.

    Authors: The environments were constructed with explicit state-dependent mechanisms (e.g., in CoopPush the continuous push force modulates the discrete grasp/release decision via a state feature combining relative position and velocity thresholds; Hybrid-Shoot conditions the discrete fire action on continuous aim angle and a state-derived target proximity scalar). We will add a dedicated subsection with formal definitions, pseudocode for the coupling, and concrete state examples to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct measurements

full rationale

The paper conducts a cross-sectional empirical study of factorization methods across algorithms and action spaces on four environments, reporting performance as direct experimental outcomes. No derivations, fitted parameters renamed as predictions, or self-referential equations are present. New environments (CoopPush, Hybrid-Shoot) and variants (VDN-PPO, PPO-MIX) are introduced and evaluated via measurements, not defined in terms of the results themselves. Self-citations, if any, are not load-bearing for central claims. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; abstract lists no mathematical derivations, fitted constants, or new postulated entities. No free parameters, axioms, or invented entities are extractable from the given text.

pith-pipeline@v0.9.1-grok · 5870 in / 1214 out tokens · 18657 ms · 2026-06-26T05:42:06.313163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages

  1. [1]

    Gustavo Campos, Nael H El-Farra, and Ahmet Palazoglu. 2022. Soft actor-critic deep reinforcement learning with hybrid mixed-integer actions for demand responsive scheduling of energy systems.Industrial & Engineering Chemistry Research61, 24 (2022), 8443–8461

  2. [2]

    Yash Chandak, Georgios Theocharous, James Kostas, Scott Jordan, and Philip Thomas. 2019. Learning action representations for reinforcement learning. InInternational conference on machine learning. PMLR, 941–950

  3. [3]

    Shaotao Chen, Xihe Qiu, Xiaoyu Tan, Zhijun Fang, and Yaochu Jin. 2022. A model-based hybrid soft actor-critic deep reinforcement learning algorithm for optimal ventilator settings.Information sciences611 (2022), 47–64

  4. [4]

    Olivier Delalleau, Maxim Peter, Eloi Alonso, and Adrien Logut. 2019. Discrete and continuous action representation for practical rl in video games. arXiv preprint arXiv:1912.11077(2019)

  5. [5]

    Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning. PMLR, 1329–1338

  6. [6]

    Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob Foerster, and Shimon Whiteson. 2023. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 37567–37593

  7. [7]

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al

  8. [8]

    InInternational conference on machine learning

    Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational conference on machine learning. PMLR, 1407–1416

  9. [9]

    Zhou Fan, Rui Su, Weinan Zhang, and Yong Yu. 2019. Hybrid actor-critic reinforcement learning in parameterized action space.arXiv preprint arXiv:1903.01344(2019)

  10. [10]

    Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual multi-agent policy gradients. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

  11. [11]

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870

  12. [12]

    Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey and critique of multiagent deep reinforcement learning.Autonomous Agents and Multi-Agent Systems33, 6 (2019), 750–797

  13. [13]

    Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144(2016)

  14. [14]

    Dmytro Korenkevych, A Rupam Mahmood, Gautham Vasan, and James Bergstra. 2019. Autoregressive policies for continuous control deep reinforcement learning.arXiv preprint arXiv:1903.11524(2019)

  15. [15]

    Boyan Li, Hongyao Tang, Yan Zheng, Jianye Hao, Pengyi Li, Zhen Wang, Zhaopeng Meng, and Li Wang. 2021. Hyar: Addressing discrete-continuous action reinforcement learning via hybrid action representation.arXiv preprint arXiv:2109.05490(2021)

  16. [16]

    Chuming Li, Jie Liu, Yinmin Zhang, Yuhong Wei, Yazhe Niu, Yaodong Yang, Yu Liu, and Wanli Ouyang. 2023. Ace: Cooperative multi-agent q-learning with bidirectional action-dependency. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 8536–8544

  17. [17]

    Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. 2023. Parallel𝑄-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation. InInternational Conference on Machine Learning. PMLR, 19440–19459

  18. [18]

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971(2015)

  19. [19]

    Zemin Eitan Liu, Yanfei Li, Quan Zhou, Yong Li, Bin Shuai, Hongming Xu, Min Hua, Guikun Tan, and Lubing Xu. 2024. Deep Reinforcement Learning- Based Energy Management for Heavy Duty HEV Considering Discrete-Continuous Hybrid Action Space.IEEE Transactions on Transportation Electrification10, 4 (2024), 9864–9876. doi:10.1109/TTE.2024.3363650

  20. [20]

    Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative- competitive environments.Advances in neural information processing systems30 (2017)

  21. [21]

    Warwick Masson, Pravesh Ranchod, and George Konidaris. 2016. Reinforcement learning with parameterized actions. InProceedings of the AAAI conference on artificial intelligence, Vol. 30

  22. [22]

    Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2012. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems.The Knowledge Engineering Review27, 1 (2012), 1–31. Manuscript submitted to ACM 22 Timothy Flavin and Sandip Sen

  23. [23]

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)

  24. [24]

    Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. 2021. Facmac: Factored multi-agent centralised policy gradients.Advances in neural information processing systems34 (2021), 12208–12221

  25. [25]

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research21, 178 (2020), 1–51

  26. [26]

    Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. InInternational conference on machine learning. PMLR, 1530–1538

  27. [27]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

  28. [28]

    Tim Seyde, Igor Gilitschenski, Wilko Schwarting, Bartolomeo Stellato, Martin Riedmiller, Markus Wulfmeier, and Daniela Rus. 2021. Is bang-bang control all you need? solving continuous control with bernoulli policies.Advances in Neural Information Processing Systems34 (2021), 27209–27221

  29. [29]

    Satinder P Singh and Richard S Sutton. 1996. Reinforcement learning with replacing eligibility traces.Machine learning22, 1 (1996), 123–158

  30. [30]

    Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. 2019. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. InInternational conference on machine learning. PMLR, 5887–5896

  31. [31]

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. 2017. Value-decomposition networks for cooperative multi-agent learning.arXiv preprint arXiv:1706.05296(2017)

  32. [32]

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation.Advances in neural information processing systems12 (1999)

  33. [33]

    Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. InProceedings of the tenth international conference on machine learning. 330–337

  34. [34]

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. 2018. Deepmind control suite.arXiv preprint arXiv:1801.00690(2018)

  35. [35]

    Arash Tavakoli, Fabio Pardo, and Petar Kormushev. 2018. Action branching architectures for deep reinforcement learning. InProceedings of the aaai conference on artificial intelligence, Vol. 32

  36. [36]

    J Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. 2021. Pettingzoo: Gym for multi-agent reinforcement learning.Advances in Neural Information Processing Systems34 (2021), 15032–15043

  37. [37]

    Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 5026–5033

  38. [38]

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. 2024. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17032(2024)

  39. [39]

    Nino Vieillard, Olivier Pietquin, and Matthieu Geist. 2020. Munchausen reinforcement learning.Advances in Neural Information Processing Systems 33 (2020), 4235–4246

  40. [40]

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning.nature575, 7782 (2019), 350–354

  41. [41]

    Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. 2020. Qplex: Duplex dueling multi-agent q-learning.arXiv preprint arXiv:2008.01062(2020)

  42. [42]

    Ze Wang, Ni Li, and Guanghong Gong. 2025. VDMPO: Policy optimization for cooperative multi-agent reinforcement learning based on joint value decomposition.Neurocomputing(2025), 131193

  43. [43]

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. InInternational conference on machine learning. PMLR, 1995–2003

  44. [44]

    Jiayi Weng, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu, Yufan Song, Ting Luo, Yukun Jiang, et al. 2022. Envpool: A highly parallel reinforcement learning environment execution engine.Advances in Neural Information Processing Systems35 (2022), 22409–22421

  45. [45]

    Jiechao Xiong, Qing Wang, Zhuoran Yang, Peng Sun, Lei Han, Yang Zheng, Haobo Fu, Tong Zhang, Ji Liu, and Han Liu. 2018. Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space.arXiv preprint arXiv:1810.06394(2018)

  46. [46]

    Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. 2021. Mastering atari games with limited data.Advances in neural information processing systems34 (2021), 25476–25488

  47. [47]

    inflating

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. 2022. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems35 (2022), 24611–24624. Manuscript submitted to ACM Revisiting Factorization 23 A Theoretical Analysis of Importance-Weighted GAE for Multi-Head ...