Fairness in two-player zero-sum games with bandit feedback

Pratik Gajane; S Akash

read the original abstract

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $\alpha/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (\alpha/m)\mathbf{1} + (1-\alpha)\widetilde{p}$ with $\widetilde{p} \in \Delta_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-\alpha)A + \alpha\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $\alpha(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $\Delta_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetilde{\Delta}(\alpha)^{2})$, where $\widetilde{\Delta}(\alpha)$ is the LP-margin gap.

Fairness in two-player zero-sum games with bandit feedback

discussion (0)