arxiv: 2604.26588 · v2 · submitted 2026-04-29 · 🧮 math.OC

Median-of-Means for Nash Equilibrium Seeking in Heavy-Tailed Games

Chao Sun , Huiming Zhang , Bo Chen , Jianzheng Wang , Zheming Wang , Li Yu This is my paper

Pith reviewed 2026-05-07 13:31 UTC · model grok-4.3

classification 🧮 math.OC

keywords Nash equilibrium seekingmedian-of-meansheavy-tailed noisestochastic approximationalmost sure convergencerobust estimationbias correctiongame theory

0 comments

The pith

Median-of-means estimators make Nash-seeking algorithms converge almost surely despite heavy-tailed gradient noise with only finite moments above one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a robust algorithm for finding Nash equilibria in games where players receive noisy gradient information with heavy tails. By applying the median-of-means technique to batch the noisy samples, average within batches, and take the median, the method produces gradient estimates that tolerate infinite variance and up to half corrupted samples. The authors prove that the resulting stochastic approximation process converges almost surely to the equilibrium and derive the rate of that convergence. They also add a bias-correction step to handle asymmetric noise distributions. This matters because many practical multi-agent systems encounter non-Gaussian noise that breaks standard methods requiring finite variance.

Core claim

By replacing the sample-average gradient with a median-of-means estimator in the continuous-time or discrete-time Nash-seeking dynamics, the algorithm achieves almost sure convergence to the Nash equilibrium set together with an explicit almost sure convergence rate when the pseudo-gradient satisfies monotonicity and the noise has a finite δ-moment for δ in (1,2]. An additional online bias-correction procedure removes the asymptotic bias induced by asymmetric noise.

What carries the argument

The median-of-means estimator applied to batches of the observed pseudo-gradient, which partitions samples into blocks, computes the mean of each block, and returns the median of the block means to produce a robust direction update.

If this is right

Almost sure convergence holds without any requirement that the noise possess finite second moments.
The estimator needs no user-chosen clipping threshold, unlike gradient-clipping approaches.
Up to half the gradient samples can be arbitrarily corrupted while consistency is retained.
The online bias-correction step restores asymptotic unbiasedness for asymmetric heavy-tailed distributions.
The same replacement works for both strongly monotone and merely monotone games under the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The identical median-of-means substitution could be dropped into other stochastic approximation schemes such as non-convex stochastic gradient descent.
When moments higher than δ exist, further rate gains might be obtained by pairing median-of-means with variance-reduction methods.
In distributed systems that face possible adversarial gradient injection, the built-in breakdown point supplies robustness without separate attack detection.
Numerical checks on empirical data from sensor networks or financial markets that exhibit power-law tails would reveal whether the predicted rates materialize outside synthetic examples.

Load-bearing premise

The game pseudo-gradient must satisfy monotonicity or strong monotonicity, and the noise must have a finite moment of order strictly greater than one.

What would settle it

Simulate the algorithm on a bilinear zero-sum game driven by stable noise with index 1.5; the sequence of actions should converge almost surely to the known equilibrium, while the identical algorithm that uses ordinary sample averages should fail to converge.

read the original abstract

This paper studies the Nash equilibrium seeking problem for stochastic games under heavy-tailed noise. The gradient noise is considered to have a finite $\delta$-th moment ($1<\delta\le 2$), which generalizes the Gaussian noise and covers cases with infinite variance. In this work, we employ the classic method Median-of-Means (MoM) in robust estimation. MoM works by dividing samples into blocks, taking the average of each block, and then taking the median of these block averages, achieving a breakdown point of up to $1/2$. This makes the final estimate reliable even when some samples are very noisy or wrong, and thus is effective to handle the heavy-tailed noise. The method also naturally defends against malicious gradient attacks. Compared with gradient clipping, which is the most popular method to deal with the heavy-tailed noise, MoM requires no preset clipping threshold and is insensitive to the tail behavior of the noise. Under standard assumptions, we prove the almost sure convergence of the algorithm and derive its almost sure convergence rate. To address the systematic bias caused by asymmetric noise, we further design an online bias correction strategy. Simulation results show the effectiveness and efficiency of the proposed algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts median-of-means to online Nash seeking under heavy-tailed noise with a bias correction and claims almost-sure convergence, but the dependence in successive gradient blocks needs explicit handling in the proofs.

read the letter

The paper's main contribution is taking the median-of-means estimator, splitting samples into blocks and taking the median of block averages, and plugging it into a stochastic approximation scheme for finding Nash equilibria when the pseudo-gradient noise has only finite moments between 1 and 2. They add an online bias correction step to offset asymmetry in the noise distribution. Under standard monotonicity or strong convexity assumptions on the game, they state almost-sure convergence and a rate, plus some simulation checks for effectiveness against heavy tails and attacks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a Median-of-Means (MoM) estimator for stochastic gradient approximation in Nash equilibrium seeking problems subject to heavy-tailed noise with finite δ-th moment (1 < δ ≤ 2). It claims to prove almost sure convergence of the resulting algorithm and its rate under standard monotonicity/Lipschitz assumptions on the pseudo-gradient, introduces an online bias-correction step for asymmetric noise, and validates the approach via simulations.

Significance. If the convergence analysis successfully extends the MoM concentration properties to the dependent, non-stationary online setting, the result would offer a robust, threshold-free alternative to gradient clipping for heavy-tailed stochastic games, with natural resistance to adversarial attacks. This could impact distributed optimization and learning in non-Gaussian environments.

major comments (2)

[Convergence Analysis] In the convergence analysis (Theorem on a.s. convergence and rate), the proof applies standard MoM tail bounds to block averages of gradients evaluated at successive iterates x_k. These bounds require i.i.d. samples, but the online recursion makes the block averages dependent and non-stationary under only monotonicity/Lipschitz assumptions on the pseudo-gradient. An explicit argument is needed showing how the step-size schedule absorbs this dependence (e.g., via mixing or uniform integrability) to justify the supermartingale/Borel-Cantelli step for a.s. convergence.
[Bias Correction Strategy] The online bias-correction recursion is introduced to handle asymmetric noise, but its coupling to past MoM estimates alters the error dynamics. The manuscript must derive the modified convergence rate explicitly and confirm that the a.s. property is preserved, as this step is load-bearing for the claim that the algorithm works under asymmetric heavy-tailed noise.

minor comments (2)

[Assumptions] A dedicated subsection enumerating all assumptions (moment condition, monotonicity, step-size requirements, block size) would clarify the 'standard assumptions' referenced in the abstract and proofs.
[Algorithm] In the algorithm description, specify how the MoM block size is chosen relative to the iteration index and step-size schedule to ensure the claimed rate holds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation and rigor of the convergence arguments.

read point-by-point responses

Referee: [Convergence Analysis] In the convergence analysis (Theorem on a.s. convergence and rate), the proof applies standard MoM tail bounds to block averages of gradients evaluated at successive iterates x_k. These bounds require i.i.d. samples, but the online recursion makes the block averages dependent and non-stationary under only monotonicity/Lipschitz assumptions on the pseudo-gradient. An explicit argument is needed showing how the step-size schedule absorbs this dependence (e.g., via mixing or uniform integrability) to justify the supermartingale/Borel-Cantelli step for a.s. convergence.

Authors: We thank the referee for this important observation. The current proof sketch invokes the MoM concentration on block averages and proceeds via a supermartingale argument, relying on the fact that the diminishing step-size α_k = Θ(k^{-β}) (with β chosen in (0,1)) forces the iterates to change slowly enough that the dependence within each block is controlled by the Lipschitz constant of the pseudo-gradient. The resulting perturbation to the tail bounds is summable, permitting the Borel-Cantelli step. Nevertheless, we agree that an explicit quantification of this dependence (via a mixing or uniform-integrability lemma) is not fully spelled out. In the revision we will insert a supporting lemma that bounds the total variation of the block averages relative to a stationary reference using the step-size decay and the monotonicity/Lipschitz assumptions, thereby justifying the direct application of the i.i.d.-style MoM bounds up to a negligible error term. revision: yes
Referee: [Bias Correction Strategy] The online bias-correction recursion is introduced to handle asymmetric noise, but its coupling to past MoM estimates alters the error dynamics. The manuscript must derive the modified convergence rate explicitly and confirm that the a.s. property is preserved, as this step is load-bearing for the claim that the algorithm works under asymmetric heavy-tailed noise.

Authors: We appreciate the referee’s emphasis on this coupling. The bias-correction recursion is constructed so that its own error is of strictly lower order than the main MoM-driven term; consequently the additional perturbation remains summable and does not destroy the supermartingale property. We have verified that the almost-sure convergence and the rate (up to a multiplicative constant) are preserved. To address the request for an explicit derivation, we will expand the proof of the main theorem to display the modified error recursion after bias correction and confirm that all summability conditions continue to hold. This expanded argument will appear in the revised main text or an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence proof invokes external MoM bounds and stochastic approximation

full rationale

The paper establishes almost sure convergence and rate for its MoM-based Nash-seeking recursion under the stated assumptions (finite δ-moment noise, monotonicity/Lipschitz pseudo-gradient). The derivation applies standard MoM concentration inequalities (cited as external robust-estimation results) inside a stochastic-approximation framework; the claimed a.s. statements are not obtained by redefining the target quantity in terms of itself, by fitting a parameter on a data subset and relabeling the fit as a prediction, or by load-bearing self-citation chains. The online dependence of successive gradients is handled by the step-size schedule and supermartingale arguments rather than by constructional equivalence to the input assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions typical of stochastic approximation for games plus the finite-moment condition on noise; no free parameters or new entities are introduced.

axioms (2)

domain assumption The underlying game satisfies standard regularity conditions such as monotonicity or strong convexity of the pseudo-gradient.
Invoked to guarantee existence and uniqueness of the Nash equilibrium and to support the convergence analysis.
domain assumption Gradient noise possesses a finite δ-th moment for some 1 < δ ≤ 2.
Central modeling assumption that enables the heavy-tailed setting while ruling out infinite variance.

pith-pipeline@v0.9.0 · 5519 in / 1257 out tokens · 53688 ms · 2026-05-07T13:31:28.115349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references

[1]

Non-cooperative games,

J. Nash, “Non-cooperative games,”Annals of mathematics, pp. 286–295, 1951

1951
[2]

Mazalov,Mathematical game theory and applications

V . Mazalov,Mathematical game theory and applications. John Wiley & Sons, 2014

2014
[3]

Gibbons,Game theory for applied economists

R. Gibbons,Game theory for applied economists. Princeton University Press, 1992

1992
[4]

S. A. Mansouri, ´A. Paredes, J. M. Gonz´alez, and J. A. Aguado, “A three- layer game theoretic-based strategy for optimal scheduling of microgrids by leveraging a dynamic demand response program designer to unlock the potential of smart buildings and electric vehicle fleets,”Applied Energy, vol. 347, p. 121440, 2023

2023
[5]

Nash learning from human feedback,

R. Munos, M. Valko, D. Calandriello, M. G. Azar, M. Rowland, Z. D. Guo, Y . Tang, M. Geist, T. Mesnard, C. Fiegelet al., “Nash learning from human feedback,” inForty-first International Conference on Machine Learning, 2024

2024
[6]

Distributed nash equilibrium seeking by a consensus based approach,

M. Ye and G. Hu, “Distributed nash equilibrium seeking by a consensus based approach,”IEEE Transactions on Automatic Control, vol. 62, no. 9, pp. 4811–4818, 2017

2017
[7]

Nash equilibrium computation in subnetwork zero-sum games with switching communi- cations,

Y . Lou, Y . Hong, L. Xie, G. Shi, and K. H. Johansson, “Nash equilibrium computation in subnetwork zero-sum games with switching communi- cations,”IEEE Transactions on Automatic Control, vol. 61, no. 10, pp. 2920–2935, 2016

2016
[8]

Distributed nash equilibrium seeking for aggregative games with coupled constraints,

S. Liang, P. Yi, and Y . Hong, “Distributed nash equilibrium seeking for aggregative games with coupled constraints,”Automatica, vol. 85, pp. 179–185, 2017

2017
[9]

Continuous-time penalty methods for nash equilib- rium seeking of a nonsmooth generalized noncooperative game,

C. Sun and G. Hu, “Continuous-time penalty methods for nash equilib- rium seeking of a nonsmooth generalized noncooperative game,”IEEE Transactions on Automatic Control, vol. 66, no. 10, pp. 4895–4902, 2020

2020
[10]

Nash equilibrium seeking in non- cooperative games,

P. Frihauf, M. Krstic, and T. Basar, “Nash equilibrium seeking in non- cooperative games,”IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1192–1207, 2011

2011
[11]

Generalized nash equilib- rium seeking strategy for distributed nonsmooth multi-cluster game,

X. Zeng, J. Chen, S. Liang, and Y . Hong, “Generalized nash equilib- rium seeking strategy for distributed nonsmooth multi-cluster game,” Automatica, vol. 103, pp. 20–26, 2019

2019
[12]

Distributed nash equilibrium seeking with limited cost function knowledge via a consensus-based gradient-free method,

Y . Pang and G. Hu, “Distributed nash equilibrium seeking with limited cost function knowledge via a consensus-based gradient-free method,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1832–1839, 2020

2020
[13]

Continuous-time integral dynamics for a class of aggregative games with coupling constraints,

C. De Persis and S. Grammatico, “Continuous-time integral dynamics for a class of aggregative games with coupling constraints,”IEEE Transactions on Automatic Control, 2019

2019
[14]

Learning generalized nash equilibria in a class of convex games,

T. Tatarenko and M. Kamgarpour, “Learning generalized nash equilibria in a class of convex games,”IEEE Transactions on Automatic Control, vol. 64, no. 4, pp. 1426–1439, 2018

2018
[15]

An operator splitting approach for distributed generalized nash equilibria computation,

P. Yi and L. Pavel, “An operator splitting approach for distributed generalized nash equilibria computation,”Automatica, vol. 102, pp. 111– 121, 2019

2019
[16]

Distributed algorithms for searching gen- eralized nash equilibrium of noncooperative games,

K. Lu, G. Jing, and L. Wang, “Distributed algorithms for searching gen- eralized nash equilibrium of noncooperative games,”IEEE Transactions on Cybernetics, vol. 49, no. 6, pp. 2362–2371, 2018

2018
[17]

Stochastic approximation approaches to the stochastic variational inequality problem,

H. Jiang and H. Xu, “Stochastic approximation approaches to the stochastic variational inequality problem,”IEEE Transactions on Au- tomatic Control, vol. 53, no. 6, pp. 1462–1475, 2008

2008
[18]

Sample average approximation methods for a class of stochastic variational inequality problems,

H. Xu, “Sample average approximation methods for a class of stochastic variational inequality problems,”Asia-Pacific Journal of Operational Research, vol. 27, no. 01, pp. 103–119, 2010

2010
[19]

Regularized iterative stochastic approximation methods for stochastic variational inequality problems,

J. Koshal, A. Nedic, and U. V . Shanbhag, “Regularized iterative stochastic approximation methods for stochastic variational inequality problems,”IEEE Transactions on Automatic Control, vol. 58, no. 3, pp. 594–609, 2012

2012
[20]

Extragradient method with variance reduction for stochastic variational inequalities,

A. N. Iusem, A. Jofr ´e, R. I. Oliveira, and P. Thompson, “Extragradient method with variance reduction for stochastic variational inequalities,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 686–724, 2017

2017
[21]

New first-order algorithms for stochastic variational inequalities,

K. Huang and S. Zhang, “New first-order algorithms for stochastic variational inequalities,”SIAM Journal on Optimization, vol. 32, no. 4, pp. 2745–2772, 2022

2022
[22]

Stochastic generalized nash equilibrium- seeking in merely monotone games,

B. Franci and S. Grammatico, “Stochastic generalized nash equilibrium- seeking in merely monotone games,”IEEE Transactions on Automatic Control, vol. 67, no. 8, pp. 3905–3919, 2021

2021
[23]

Decentralized local stochastic extra-gradient for variational inequalities,

A. Beznosikov, P. Dvurechenskii, A. Koloskova, V . Samokhin, S. U. Stich, and A. Gasnikov, “Decentralized local stochastic extra-gradient for variational inequalities,”Advances in Neural Information Processing Systems, vol. 35, pp. 38 116–38 133, 2022

2022
[24]

Distributed variable sample-size gradient- response and best-response schemes for stochastic nash equilibrium problems,

J. Lei and U. V . Shanbhag, “Distributed variable sample-size gradient- response and best-response schemes for stochastic nash equilibrium problems,”SIAM Journal on Optimization, vol. 32, no. 2, pp. 573–603, 2022

2022
[25]

Distributed learning for stochastic generalized nash equilibrium problems,

C.-K. Yu, M. van der Schaar, and A. Sayed, “Distributed learning for stochastic generalized nash equilibrium problems,”IEEE Transactions on Signal Processing, 2017

2017
[26]

Distributed forward- backward (half) forward algorithms for generalized nash equilibrium seeking,

B. Franci, M. Staudigl, and S. Grammatico, “Distributed forward- backward (half) forward algorithms for generalized nash equilibrium seeking,” in2020 European Control Conference (ECC). IEEE, 2020, pp. 1274–1279

2020
[27]

Distributed primal-dual algorithms for stochastic generalized nash equilibrium seeking under full and partial-decision information,

L. Zheng, H. Li, L. Ran, L. Gao, and D. Xia, “Distributed primal-dual algorithms for stochastic generalized nash equilibrium seeking under full and partial-decision information,”IEEE Transactions on Control of Network Systems, 2022

2022
[28]

A tail-index analysis of stochastic gradient noise in deep neural networks,

U. Simsekli, L. Sagun, and M. Gurbuzbalaban, “A tail-index analysis of stochastic gradient noise in deep neural networks,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 5827–5837

2019
[29]

Why are adaptive methods good for attention models?

J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?”Ad- vances in Neural Information Processing Systems, vol. 33, pp. 15 383– 15 393, 2020

2020
[30]

Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,

E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,”Advances in Neural Information Processing Systems, vol. 33, pp. 15 042–15 053, 2020

2020
[31]

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact,

N. Kornilov, A. Gasnikov, P. Dvurechensky, and D. Dvinskikh, “Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact,”Computational Manage- ment Science, vol. 20, no. 1, p. 37, 2023

2023
[32]

Clipped stochastic methods for variational inequalities with heavy-tailed noise,

E. Gorbunov, M. Danilova, D. Dobre, P. Dvurechenskii, A. Gasnikov, and G. Gidel, “Clipped stochastic methods for variational inequalities with heavy-tailed noise,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 319–31 332, 2022

2022
[33]

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance,

A. Sadiev, M. Danilova, E. Gorbunov, S. Horv ´ath, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richt ´arik, “High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance,” inInternational conference on machine learning. PMLR, 2023, pp. 29 563–29 648

2023
[34]

Distributed stochastic nash equilibrium seeking under heavy-tailed noises,

C. Sun, B. Chen, J. Wang, Z. Wang, and L. Yu, “Distributed stochastic nash equilibrium seeking under heavy-tailed noises,”Automatica, vol. 173, p. 112081, 2025

2025
[35]

Real and complex monotone communication games,

G. Scutari, F. Facchinei, J.-S. Pang, and D. P. Palomar, “Real and complex monotone communication games,”IEEE Transactions on In- formation Theory, vol. 60, no. 7, pp. 4197–4231, 2014

2014
[36]

B. T. Polyak,Introduction to optimization. Optimization Software, Inc., New York, 1987

1987
[37]

Bandits with heavy tail,

S. Bubeck, N. Cesa-Bianchi, and G. Lugosi, “Bandits with heavy tail,” IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7711– 7717, 2013. APPENDIX A. Proof of inequality(13)in Lemma 2 DefineW j =−Z j forj= 1,· · ·, m. Then,W 1,· · ·, W m are i.i.d. with mean ¯θW =− ¯θand satisfy the same moment condition E[|W1 − ¯θW |1+α]≤u.(28) Let ¯Wℓ := 1 s...

2013