High entropy leads to symmetry equivariant policies in Dec-POMDPs
Pith reviewed 2026-05-17 03:48 UTC · model grok-4.3
The pith
Sufficiently high entropy regularization in any Dec-POMDP makes policy gradient flow with tabular softmax converge to the same symmetry-equivariant joint policy from every initialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant with respect to all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns.
What carries the argument
The policy gradient flow under tabular softmax parametrization with sufficiently high entropy regularization, which acts as a symmetry-forcing attractor.
If this is right
- Policies trained from different random seeds become fully compatible: their cross-play returns equal their self-play returns.
- The entropy coefficient exerts a massive influence on cross-play performance in Hanabi, Overcooked and Yokai.
- Any loss in self-play returns caused by higher entropy can often be recovered by greedifying the policy after training.
- Hyperparameter sweeps for Dec-POMDPs should test far higher entropy coefficients than are currently standard.
- There exist Dec-POMDPs in which the optimal symmetry-equivariant policy cannot be recovered by this route.
Where Pith is reading between the lines
- The same regularization pressure might promote compatible policies in deep-network approximations if the symmetry-attracting effect survives function approximation.
- Explicit symmetry-breaking mechanisms or communication channels could become less necessary once entropy is tuned high enough.
- The result suggests testing whether the same high-entropy regime improves zero-shot coordination in continuous or partially observable settings beyond the tabular case.
- It would be useful to measure how large the entropy coefficient must be, relative to reward scale, before the equivariant policy becomes the unique attractor.
Load-bearing premise
The assumption that the entropy regularization can be made high enough to turn the symmetry-equivariant policy into the unique global attractor of the gradient flow for every possible Dec-POMDP.
What would settle it
A concrete Dec-POMDP together with an explicit initialization from which the policy gradient flow still converges to a non-equivariant joint policy even as the entropy coefficient is taken arbitrarily large.
read the original abstract
We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient flow under tabular softmax parametrization converges from any initialization to the same joint policy, which is equivariant with respect to all symmetries of the Dec-POMDP. This implies that policies from different initializations are fully compatible, with cross-play returns equal to self-play returns. Empirically, independent PPO is evaluated in Hanabi, Overcooked, and Yokai; higher entropy coefficients substantially improve cross-play returns, and post-training greedification often offsets any self-play performance drop, yielding a new SOTA in Hanabi inter-seed cross-play. The authors recommend considering substantially higher entropy coefficients than is typical in Dec-POMDP hyperparameter searches.
Significance. If the stated convergence result holds, the work would provide a theoretical explanation for the emergence of symmetry-equivariant policies under entropy regularization in Dec-POMDPs and a practical guideline for improving cross-play compatibility among independently trained agents. The combination of a claimed general proof for the tabular case with concrete empirical gains (including SOTA cross-play in Hanabi) would be a useful contribution to multi-agent RL, particularly for settings where symmetry and agent compatibility matter.
major comments (1)
- Abstract: The central claim is a convergence proof that sufficiently high entropy regularization forces policy-gradient flow with tabular softmax to a unique symmetry-equivariant joint policy in arbitrary Dec-POMDPs. No derivation, key lemmas, or definition of the entropy threshold is supplied in the manuscript, so the correctness of this load-bearing theoretical result cannot be verified.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for the positive assessment of its potential significance. We address the major comment below and will incorporate clarifications in the revised version.
read point-by-point responses
-
Referee: [—] Abstract: The central claim is a convergence proof that sufficiently high entropy regularization forces policy-gradient flow with tabular softmax to a unique symmetry-equivariant joint policy in arbitrary Dec-POMDPs. No derivation, key lemmas, or definition of the entropy threshold is supplied in the manuscript, so the correctness of this load-bearing theoretical result cannot be verified.
Authors: We agree that the abstract itself contains no derivation or lemmas, as is conventional for abstracts. The full proof, including the definition of the entropy threshold (the minimum coefficient such that the regularized objective has a unique maximizer that is symmetry-equivariant) and the key lemmas establishing global convergence of the policy-gradient flow under tabular softmax parametrization from any initialization, appears in Section 3 of the main text with supporting technical details in the appendix. We will revise the abstract to include a brief reference to this section and a short statement of the threshold condition. If the referee has particular questions about any step of the argument, we are happy to expand the exposition or add an illustrative example in the revision. revision: yes
Circularity Check
No significant circularity; proof claim is self-contained
full rationale
The abstract states a convergence result under high entropy regularization for policy gradient flow in Dec-POMDPs, leading to a unique symmetry-equivariant joint policy. No equations, fitted parameters, or self-citations appear in the provided text. The claim is presented as a direct proof rather than a renaming, ansatz smuggling, or reduction to prior self-referential inputs. Without derivation details that reduce to the inputs by construction, the result does not exhibit circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Policy gradient flow under tabular softmax parametrization converges under sufficiently high entropy regularization
- domain assumption Dec-POMDPs possess well-defined symmetries to which policies can be equivariant
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
increasing the entropy coefficient α makes the optimization landscape “more concave,” since entropy is a strictly concave function on the probability simplex
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.