High entropy leads to symmetry equivariant policies in Dec-POMDPs

Andreas Bulling; Constantin Ruhdorfer; Jakob Foerster; Johannes Forkel

arxiv: 2511.22581 · v4 · submitted 2025-11-27 · 💻 cs.LG · cs.MA

High entropy leads to symmetry equivariant policies in Dec-POMDPs

Johannes Forkel , Constantin Ruhdorfer , Andreas Bulling , Jakob Foerster This is my paper

Pith reviewed 2026-05-17 03:48 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords Dec-POMDPentropy regularizationpolicy gradientsymmetry equivariancemulti-agent reinforcement learningtabular softmaxcross-play performanceindependent PPO

0 comments

The pith

Sufficiently high entropy regularization in any Dec-POMDP makes policy gradient flow with tabular softmax converge to the same symmetry-equivariant joint policy from every initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in decentralized partially observable Markov decision processes, adding enough entropy regularization to the objective guarantees that gradient flow under a tabular softmax policy always reaches one particular joint policy no matter where training starts. That policy respects every symmetry present in the Dec-POMDP, so agents behave identically under symmetric observations or actions. A sympathetic reader cares because the result directly addresses the practical failure of independently trained agents to cooperate when they are later paired together. The authors also report that raising the entropy coefficient dramatically improves cross-play scores in Hanabi, Overcooked and Yokai, and that any drop in self-play performance can often be recovered by simply taking the greedy action after training.

Core claim

In any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant with respect to all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns.

What carries the argument

The policy gradient flow under tabular softmax parametrization with sufficiently high entropy regularization, which acts as a symmetry-forcing attractor.

If this is right

Policies trained from different random seeds become fully compatible: their cross-play returns equal their self-play returns.
The entropy coefficient exerts a massive influence on cross-play performance in Hanabi, Overcooked and Yokai.
Any loss in self-play returns caused by higher entropy can often be recovered by greedifying the policy after training.
Hyperparameter sweeps for Dec-POMDPs should test far higher entropy coefficients than are currently standard.
There exist Dec-POMDPs in which the optimal symmetry-equivariant policy cannot be recovered by this route.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization pressure might promote compatible policies in deep-network approximations if the symmetry-attracting effect survives function approximation.
Explicit symmetry-breaking mechanisms or communication channels could become less necessary once entropy is tuned high enough.
The result suggests testing whether the same high-entropy regime improves zero-shot coordination in continuous or partially observable settings beyond the tabular case.
It would be useful to measure how large the entropy coefficient must be, relative to reward scale, before the equivariant policy becomes the unique attractor.

Load-bearing premise

The assumption that the entropy regularization can be made high enough to turn the symmetry-equivariant policy into the unique global attractor of the gradient flow for every possible Dec-POMDP.

What would settle it

A concrete Dec-POMDP together with an explicit initialization from which the policy gradient flow still converges to a non-equivariant joint policy even as the entropy coefficient is taken arbitrarily large.

read the original abstract

We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High entropy regularization can force policy gradient flows in Dec-POMDPs to a unique symmetry-equivariant joint policy that fixes cross-play incompatibility.

read the letter

The core result here is that sufficiently high entropy regularization in tabular policy gradient dynamics for any Dec-POMDP drives convergence to the same joint policy from any initialization, and that policy is equivariant under the environment's symmetries. This directly gives cross-play returns that match self-play returns, which is the practical payoff for cooperative settings like Hanabi where independently trained agents often fail to coordinate. The paper states a convergence proof for the tabular softmax case and then shows through independent PPO runs in Hanabi, Overcooked, and Yokai that raising the entropy coefficient lifts cross-play scores substantially. In Hanabi they report a new inter-seed cross-play SOTA after adding post-training greedification to offset the self-play drop. The explicit link between entropy strength and guaranteed equivariance plus compatibility looks new relative to the MARL work they cite, and the experiments make a clear case that the entropy coefficient deserves more attention in hyperparameter sweeps than is common. The theory is limited to tabular softmax parametrization, so the deep-network results rest on the extra greedification step, which is presented as a practical fix rather than something the proof covers. The paper notes that some Dec-POMDPs cannot reach the optimal equivariant policy this way, but does not explore how often that happens or what alternatives exist. Without the full derivation it is hard to judge how large the entropy coefficient must be in practice or whether the convergence remains robust when function approximation is introduced. This is aimed at people working on cooperative multi-agent RL with partial observability and symmetries, especially those who care about cross-play or symmetry-aware training. It has a concrete theoretical claim plus empirical evidence that is strong enough to deserve a serious referee, even if the deep RL side will need tighter controls and the proof will get close scrutiny.

Referee Report

1 major / 0 minor

Summary. The paper proves that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient flow under tabular softmax parametrization converges from any initialization to the same joint policy, which is equivariant with respect to all symmetries of the Dec-POMDP. This implies that policies from different initializations are fully compatible, with cross-play returns equal to self-play returns. Empirically, independent PPO is evaluated in Hanabi, Overcooked, and Yokai; higher entropy coefficients substantially improve cross-play returns, and post-training greedification often offsets any self-play performance drop, yielding a new SOTA in Hanabi inter-seed cross-play. The authors recommend considering substantially higher entropy coefficients than is typical in Dec-POMDP hyperparameter searches.

Significance. If the stated convergence result holds, the work would provide a theoretical explanation for the emergence of symmetry-equivariant policies under entropy regularization in Dec-POMDPs and a practical guideline for improving cross-play compatibility among independently trained agents. The combination of a claimed general proof for the tabular case with concrete empirical gains (including SOTA cross-play in Hanabi) would be a useful contribution to multi-agent RL, particularly for settings where symmetry and agent compatibility matter.

major comments (1)

Abstract: The central claim is a convergence proof that sufficiently high entropy regularization forces policy-gradient flow with tabular softmax to a unique symmetry-equivariant joint policy in arbitrary Dec-POMDPs. No derivation, key lemmas, or definition of the entropy threshold is supplied in the manuscript, so the correctness of this load-bearing theoretical result cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive assessment of its potential significance. We address the major comment below and will incorporate clarifications in the revised version.

read point-by-point responses

Referee: [—] Abstract: The central claim is a convergence proof that sufficiently high entropy regularization forces policy-gradient flow with tabular softmax to a unique symmetry-equivariant joint policy in arbitrary Dec-POMDPs. No derivation, key lemmas, or definition of the entropy threshold is supplied in the manuscript, so the correctness of this load-bearing theoretical result cannot be verified.

Authors: We agree that the abstract itself contains no derivation or lemmas, as is conventional for abstracts. The full proof, including the definition of the entropy threshold (the minimum coefficient such that the regularized objective has a unique maximizer that is symmetry-equivariant) and the key lemmas establishing global convergence of the policy-gradient flow under tabular softmax parametrization from any initialization, appears in Section 3 of the main text with supporting technical details in the appendix. We will revise the abstract to include a brief reference to this section and a short statement of the threshold condition. If the referee has particular questions about any step of the argument, we are happy to expand the exposition or add an illustrative example in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proof claim is self-contained

full rationale

The abstract states a convergence result under high entropy regularization for policy gradient flow in Dec-POMDPs, leading to a unique symmetry-equivariant joint policy. No equations, fitted parameters, or self-citations appear in the provided text. The claim is presented as a direct proof rather than a renaming, ansatz smuggling, or reduction to prior self-referential inputs. Without derivation details that reduce to the inputs by construction, the result does not exhibit circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; ledger inferred from stated conditions. The proof relies on standard properties of entropy-regularized policy gradients and the existence of symmetries in the Dec-POMDP.

axioms (2)

domain assumption Policy gradient flow under tabular softmax parametrization converges under sufficiently high entropy regularization
Central to the stated proof for any Dec-POMDP.
domain assumption Dec-POMDPs possess well-defined symmetries to which policies can be equivariant
Invoked when claiming the converged policy respects all symmetries.

pith-pipeline@v0.9.0 · 5491 in / 1182 out tokens · 62703 ms · 2026-05-17T03:48:27.835788+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

increasing the entropy coefficient α makes the optimization landscape “more concave,” since entropy is a strictly concave function on the probability simplex

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.