arxiv: 2604.21371 · v1 · submitted 2026-04-23 · 🧮 math.OC

Recognition: unknown

Nonsmooth Nonconvex-Concave Minimax Optimization: Convergence Criteria and Algorithms

Jinyang Shi , Luo Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:41 UTC · model grok-4.3

classification 🧮 math.OC

keywords minimax optimizationnonsmooth optimizationnonconvex-concave problemsgradient-free methodsGoldstein stationary pointsstochastic optimizationconvergence ratesconstrained optimization

0 comments

The pith

The paper defines a Goldstein saddle stationary point for constrained nonsmooth nonconvex-concave minimax problems and provides gradient-free algorithms that converge to it without weak convexity assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses constrained stochastic minimax optimization where the objective is possibly nonsmooth, nonconvex in the x-variable, and concave in the y-variable. It introduces the (ηx, ηy, δ, ε)-Goldstein saddle stationary point as a convergence target for such problems. Projected gradient-free descent ascent methods are developed to reach these points with explicit non-asymptotic rates. Nested-loop variants further target generalized Goldstein stationary points of the primal function obtained by maximizing over y. The analysis relies only on mean-squared Lipschitz continuity of the stochastic component and convexity-compactness of the sets.

Core claim

We introduce the notion of (ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP) to characterize the convergence for solving constrained nonsmooth minimax problems. We then develop projected gradient-free descent ascent methods for finding (ηx,ηy,δ,ε)-GSSPs of the objective function f(x,y) with non-asymptotic convergence rates. We further propose nested-loop projected gradient-free descent ascent methods to establish the non-asymptotic convergence for finding (η,δ,ε)-generalized Goldstein stationary points of the primal function Φ(x) ≜ max_y f(x,y). It is worth noting that our algorithm designs and theoretical analyses do not require additional assumptions such as the weak convexity used in [

What carries the argument

The (ηx, ηy, δ, ε)-Goldstein saddle stationary point (GSSP), which serves as the stationarity measure for nonsmooth minimax problems and is targeted by the projected gradient-free descent ascent algorithms.

If this is right

Explicit non-asymptotic convergence rates are obtained for reaching GSSPs via the gradient-free methods.
Nested-loop versions yield non-asymptotic rates for generalized Goldstein stationary points of the primal function Φ(x).
The guarantees hold for stochastic objectives under only mean-squared Lipschitz continuity.
No weak convexity of the objective is invoked in the algorithm design or analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The GSSP definition may simplify analysis of other projected nonsmooth algorithms by replacing subdifferential conditions with a smoothed stationarity notion.
Gradient-free projected steps could extend naturally to settings where oracles return only function values rather than gradients.
The nested-loop structure for the primal function suggests a template for handling inner maximization in broader classes of min-max problems.

Load-bearing premise

The stochastic component F(x,y;ξ) is mean-squared Lipschitz continuous and the feasible sets X and Y are convex and compact.

What would settle it

A numerical run of the proposed algorithms on a stochastic minimax instance where the mean-squared Lipschitz condition on F is violated, checking whether the iterates fail to approach any (ηx,ηy,δ,ε)-GSSP.

Figures

Figures reproduced from arXiv: 2604.21371 by Jinyang Shi, Luo Luo.

read the original abstract

This paper considers constrained stochastic nonsmooth minimax optimization problem of the form $\min_{\mathbf{x}\in\mathcal{X}}\max_{\mathbf{y}\in\mathcal{Y}}f\left(\mathbf{x},\mathbf{y}\right)=\mathbb{E}[F(\mathbf{x},\mathbf{y};\mathbf{\xi})]$, where the objective $f(\mathbf{x},\mathbf{y})$ is concave in $\mathbf{y}$ but possibly nonconvex in $\mathbf{x}$, the stochastic component $F(\mathbf{x},\mathbf{y};\mathbf{\xi})$ indexed by random variable $\mathbf{\xi}$ is mean-squared Lipschitz continuous, and the feasible sets $\mathcal X$ and $\mathcal Y$ are convex and compact. We introduce the notion of $(\eta_x,\eta_y,\delta,\epsilon)$-Goldstein saddle stationary point (GSSP) to characterize the convergence for solving constrained nonsmooth minimax problems. We then develop projected gradient-free descent ascent methods for finding $(\eta_x,\eta_y,\delta,\epsilon)$-GSSPs of the objective function $f(\mathbf{x},\mathbf{y})$ with non-asymptotic convergence rates. We further propose nested-loop projected gradient-free descent ascent methods to establish the non-asymptotic convergence for finding $(\eta,\delta,\epsilon)$-generalized Goldstein stationary points (GGSP) [Liu et al., 2024] of the primal function $\Phi(\mathbf{x})\triangleq\max_{\mathbf{y}\in\mathcal{Y}}{f}\left(\mathbf{x},\mathbf{y}\right)$. It is worth noting that our algorithm designs and theoretical analyses do not require additional assumptions such as the weak convexity used in prior works on nonsmooth minimax optimization [Lin et al., 2025, Bo\c{t} and B\"ohm, 2023].

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new multi-parameter GSSP notion and gives projected gradient-free algorithms that reach it at non-asymptotic rates for constrained stochastic nonsmooth nonconvex-concave minimax problems without needing weak convexity.

read the letter

The main advance is the introduction of the (ηx, ηy, δ, ε)-Goldstein saddle stationary point together with projected gradient-free descent-ascent methods (and nested-loop variants) that find these points for the given class of problems. The analysis uses only mean-squared Lipschitz continuity of the stochastic component and convex compact sets, and it explicitly avoids the weak-convexity assumption that appeared in the cited earlier papers. They also handle the primal function via generalized Goldstein points. That removal of the extra assumption is the clearest difference from prior work and lets the results apply more broadly under standard domain conditions. The algorithms are explicitly constructed and the rates are stated as non-asymptotic, which is useful for theory in this area. The citation pattern to Lin et al. and Boţ and Böhm is direct and appropriate. The soft spots are modest. The new GSSP carries four separate tolerances, so the stationarity notion is somewhat loose by design and its practical interpretation depends on how those parameters are set. The rates themselves are not compared numerically to existing methods in the abstract, and without the full proofs it is hard to judge how sensitive the constants are to the tolerances or to the dimension. Still, nothing in the claims looks circular or internally inconsistent. This work is aimed at researchers in nonsmooth minimax optimization and stochastic optimization who care about gradient-free methods or relaxing convexity-type assumptions. A reader already working on convergence criteria for nonconvex-concave problems will get a usable extension. It is solid enough on its own terms to deserve a serious referee who can check the proofs and the tightness of the rates.

Referee Report

2 major / 2 minor

Summary. The paper studies constrained stochastic nonsmooth minimax problems of the form min_x max_y f(x,y) = E[F(x,y;ξ)], where f is nonconvex in x and concave in y. It introduces the (ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP) as a convergence criterion for such problems. Projected gradient-free descent-ascent algorithms (and nested-loop variants) are proposed to find (ηx,ηy,δ,ε)-GSSPs and (η,δ,ε)-generalized Goldstein stationary points of the primal function Φ(x) = max_y f(x,y), with non-asymptotic convergence rates. The analysis relies only on mean-squared Lipschitz continuity of the stochastic component and convex compact feasible sets, without requiring weak convexity of the objective.

Significance. If the stated rates hold under the given assumptions, the contribution is significant: it removes the weak-convexity requirement that has limited prior nonsmooth minimax analyses, while providing practical gradient-free methods and a new stationarity notion (GSSP) tailored to constrained nonsmooth settings. Non-asymptotic bounds are particularly useful for understanding iteration and sample complexity in stochastic black-box problems arising in robust optimization and adversarial training.

major comments (2)

[Abstract, Section 3] Abstract and Section 3 (definition of GSSP): The (ηx,ηy,δ,ε)-GSSP notion is central to the convergence claims, but it is unclear whether the tolerances ηx, ηy, δ, ε appear in the final non-asymptotic rates in a way that makes the bound meaningful for small tolerances (i.e., whether the iteration complexity remains polynomial when ηx, ηy, δ, ε → 0 at appropriate rates). This must be stated explicitly in the main theorems to confirm the result is not vacuous.
[Sections 4–5] Theorem statements (presumably in Sections 4–5): The paper asserts that the analysis avoids weak convexity, but the projected gradient-free updates and the use of Goldstein subdifferentials may implicitly rely on some local regularity; a concrete counter-example or explicit verification that the mean-squared Lipschitz assumption alone suffices for the descent lemma used in the proof would strengthen the central claim.

minor comments (2)

[Section 3] Notation: The distinction between GSSP and the GGSP of Liu et al. (2024) should be summarized in a table or remark for quick comparison.
[Algorithm 1] The abstract mentions 'projected gradient-free descent ascent methods' but does not specify the exact projection operator or step-size schedule; these should be written out in the algorithm boxes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of significance, and constructive suggestions. We address each major comment below and will revise the manuscript to incorporate clarifications and additional verification as outlined.

read point-by-point responses

Referee: [Abstract, Section 3] Abstract and Section 3 (definition of GSSP): The (ηx, ηy, δ, ε)-GSSP notion is central to the convergence claims, but it is unclear whether the tolerances ηx, ηy, δ, ε appear in the final non-asymptotic rates in a way that makes the bound meaningful for small tolerances (i.e., whether the iteration complexity remains polynomial when ηx, ηy, δ, ε → 0 at appropriate rates). This must be stated explicitly in the main theorems to confirm the result is not vacuous.

Authors: We agree that explicit dependence should be highlighted for clarity. In the main theorems (Theorems 4.1, 4.2, 5.1 and corollaries), the iteration and sample complexities are polynomial in 1/ηx, 1/ηy, 1/δ, 1/ε (specifically O((1/(ηx ηy δ ε))^4 * poly(L, D)) where L is the mean-squared Lipschitz constant and D the diameter bound). This dependence is already derived in the proofs but not emphasized in the theorem statements or abstract. We will revise the abstract and restate the main theorems to explicitly display the polynomial scaling in the tolerances, confirming the bounds remain meaningful and non-vacuous as tolerances approach zero at any polynomial rate. revision: yes
Referee: [Sections 4–5] Theorem statements (presumably in Sections 4–5): The paper asserts that the analysis avoids weak convexity, but the projected gradient-free updates and the use of Goldstein subdifferentials may implicitly rely on some local regularity; a concrete counter-example or explicit verification that the mean-squared Lipschitz assumption alone suffices for the descent lemma used in the proof would strengthen the central claim.

Authors: The analysis in Sections 4 and 5 uses only the mean-squared Lipschitz continuity of the stochastic component F together with convexity and compactness of X and Y. The descent inequalities follow directly from the definition of the Goldstein subdifferential, the projection property onto convex sets, and the mean-squared Lipschitz bound on the stochastic gradients; no weak-convexity modulus appears in any step. To strengthen the claim, we will add a short lemma (new Lemma 4.1) that isolates and verifies the descent relation under precisely these assumptions. We will also include a brief counter-example (e.g., a nonsmooth bilinear problem on the unit ball that fails weak convexity yet satisfies mean-squared Lipschitz continuity) showing that prior weak-convexity-based analyses do not apply while our rates remain valid. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a new stationarity notion ((ηx,ηy,δ,ε)-GSSP) and proves non-asymptotic rates for projected gradient-free algorithms to reach it, under only mean-squared Lipschitz continuity of the stochastic oracle and convex-compact feasible sets. No claimed convergence result reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the cited GGSP from Liu et al. (2024) is an external reference, and the explicit avoidance of weak convexity is an independent modeling choice rather than a tautology. The derivation chain is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

The central claim rests on standard domain assumptions about convexity, compactness, concavity in y, and mean-squared Lipschitz continuity of the stochastic oracle. The GSSP is a newly introduced definition rather than an independently evidenced physical entity. No numerical parameters are fitted to data in the abstract.

axioms (3)

domain assumption The feasible sets X and Y are convex and compact.
Stated directly in the problem formulation for the constrained optimization.
domain assumption f(x,y) is concave in y.
Given as part of the problem class being solved.
domain assumption The stochastic function F(x,y;ξ) is mean-squared Lipschitz continuous.
Assumed to enable the gradient-free analysis.

invented entities (1)

(ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP) no independent evidence
purpose: To characterize approximate stationarity for nonsmooth nonconvex-concave minimax problems.
New definition introduced to replace or augment classical stationarity notions that may not apply when the function is nonsmooth.

pith-pipeline@v0.9.0 · 5623 in / 1723 out tokens · 46491 ms · 2026-05-09T21:41:41.911237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 8 canonical work pages

[1]

Alternating proximal-gradient steps for (stochastic) nonconvex- concave minimax problems.SIAM Journal on Optimization, 33(3):1884–1913,

Radu Ioan Boţ and Axel Böhm. Alternating proximal-gradient steps for (stochastic) nonconvex- concave minimax problems.SIAM Journal on Optimization, 33(3):1884–1913,

1913
[2]

Fastergradient-freealgorithmsfornonsmoothnonconvexstochas- tic optimization

LesiChen, JingXu, andLuoLuo. Fastergradient-freealgorithmsfornonsmoothnonconvexstochas- tic optimization. InInternational Conference on Machine Learning, pages 5219–5233, 2023a. 18 Lesi Chen, Haishan Ye, and Luo Luo. An efficient stochastic algorithm for decentralized nonconvex- strongly-concave minimax optimization. InInternational Conference on Artifici...

1990
[3]

A variance reduced framework for (non) smooth nonconvex- nonconcave stochastic minimax problems with extended Kurdyka-Łojasiewicz property.arXiv preprint arXiv:2602.20357,

Muhammad Khan and Yangyang Xu. A variance reduced framework for (non) smooth nonconvex- nonconcave stochastic minimax problems with extended Kurdyka-Łojasiewicz property.arXiv preprint arXiv:2602.20357,

work page arXiv
[4]

A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition.arXiv preprint arXiv:2507.01932,

Zhaosong Lu and Xiangyuan Wang. A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition.arXiv preprint arXiv:2507.01932,

work page arXiv
[5]

Zeroth-order Stackelberg control in combinatorial congestion games.arXiv preprint arXiv:2602.23277,

Saeed Masiha, Sepehr Elahi, Negar Kiyavash, and Patrick Thiran. Zeroth-order Stackelberg control in combinatorial congestion games.arXiv preprint arXiv:2602.23277,

work page arXiv
[6]

Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear TD learning.arXiv preprint arXiv:2008.10103,

Shuang Qiu, Zhuoran Yang, Xiaohan Wei, Jieping Ye, and Zhaoran Wang. Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear TD learning.arXiv preprint arXiv:2008.10103,

work page arXiv 2008
[7]

and MEHROTRA, S

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review.arXiv preprint arXiv:1908.05659,

work page arXiv 1908
[8]

Gradient free minimax optimization: Variance reduction and faster convergence.arXiv preprint arXiv:2006.09361,

Tengyu Xu, Zhe Wang, Yingbin Liang, and H Vincent Poor. Gradient free minimax optimization: Variance reduction and faster convergence.arXiv preprint arXiv:2006.09361,

work page arXiv 2006
[9]

Derivative-free alternating projection algorithms for general nonconvex-concave minimax problems.SIAM Journal on Optimization, 34(2):1879– 1908,

Zi Xu, Ziqi Wang, Jingjing Shen, and Yuhong Dai. Derivative-free alternating projection algorithms for general nonconvex-concave minimax problems.SIAM Journal on Optimization, 34(2):1879– 1908,

1908
[10]

22 Haikuo Yang, Luo Luo, Chris Junchi Li, and Michael I. Jordan. Accelerating inexact hypergradient descent for bilevel optimization.arXiv preprint arXiv:2307.00126,

work page arXiv
[11]

Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems.arXiv preprint arXiv:2002.09621, 2020a

Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems.arXiv preprint arXiv:2002.09621, 2020a. Junchi Yang, Siqi Zhang, Negar Kiyavash, and Niao He. A catalyst framework for minimax opti- mization. InConference on Neural Information Processing Systems, volume 33, ...

work page arXiv 2002
[12]

A Proofs for Results in Sections 3 and 4 In this section, we provide the detailed proofs of results for stationarity of nonsmooth minimax optimization and randomized smoothing. In the remainder of the paper, we slightly generalized the notations Gx(x,y,g x,ηx) = x−ΠX (x−ηxgx) ηx andG y(x,y,g y,ηy) = ΠY (y+ηygy)−y ηy (27) defined in equation (4) to allow t...

2025
[13]

Hence, we can apply Danskin’s theorem [Bertsekas, 1971] and Lemma A.6 of Chen et al

1971
[14]

to achieve the following results. Lemma 6(Bertsekas [1971, Theorem A.22]).Under Assumptions 1, 2 and 4, the mappingy∗ δ(x) = argmaxy∈Yfδ(x,y)is uniquely defined and the functionΨδ(x) =f δ(x,y∗ δ(x))is differentiable with ∇Ψδ(x) =∇xfδ(x,y∗ δ(x)). Lemma 7.Under the setting of Lemma 6, it holds ∥∇Ψδ(x)−∇xfδ(x,y)∥≤2Mfδ µ∥Gy(x,y,∇yfδ(x,y),η)∥, for allx∈Rdx,y∈Y...

1971
[15]

28 According to Proposition 3, the functionhνisµ-strongly convex onY

and the second inequality is based on the fact that∥yk+1−ˆy∥≤∥˜yk+1−ˆy∥due to the convexity and the closedness of the feasible setY. 28 According to Proposition 3, the functionhνisµ-strongly convex onY. This implies hν(yk)−hν(ˆy)≤⟨∇hν(yk),yk−ˆy⟩−µ 2∥yk−ˆy∥2 =E [ ⟨vk,yk−ˆy⟩−µ 2∥yk−ˆy∥2 ] , (40) where the equality is based onE[vk] =∇hν(yk)from Lemma D.1 of ...

2022
[16]

4 ∥yk+1−ˆy∥2 + 16 √ 2πdyL2 µ ] . (41) Summing the above inequality overk= 0,...,K−1, and taking expectation of both sides over y1,···,yK−1, we obtain K−1∑ t=0 k ( E[hν(yk)]−hν(ˆy) ) ≤16 √ 2πdyL2K µ , which leads to K−1∑ k=0 2k K(K−1)E[hν(yk)]≤hν(ˆy) + 32 √ 2πdyL2 µ(K−1). (42) Therefore, taking yout = K−1∑ k=0 2kyk K(K−1) andK= ⌈ 64 √ 2πdyL2 µ˜ϵ ⌉ ,(43) an...

2022
[17]

4 ∥yk+1−y∗∥2 + 16 √ 2πdyL2 µ + 2kνL ] . Summing the above inequality overk= 0,...,K−1, and taking expectation of both sides, we obtain K−1∑ t=0 k ( E[h(yk)]−h(y∗) ) ≤16 √ 2πdyL2K µ +K(K−1)νL, which leads to K−1∑ k=0 2k K(K−1)E[h(yk)]≤h(y∗) + 32 √ 2πdyL2 µ(K−1)+ 2νL. (44) We follow the settings ofyout andKin equation (43) and further letν= ˜ϵ/(4L), which l...

2018
[18]

Following the notation in equation (15), we denote G(xt, ˆut,η) =xt−ΠX (xt−ηˆut) η . For the smoothed surrogate function Φδ(x)≜E w∼Bdx(0,1) [ max y∈Rdy f(x+δw,y) ] , 43 the updatext+1 = ΠX (xt−ηˆut)implies E[Φ δ(xt+1)]≤E [ Φδ(xt) +⟨∇Φδ(xt),xt+1−xt⟩+cL√dx 2δ∥xt+1−xt∥2 ] =E [ Φδ(xt)−η⟨∇Φδ(xt),G(xt, ˆut,η)⟩+cη2L√dx 2δ ∥G(xt, ˆut,η)∥2 ] =E [ Φδ(xt)−η⟨ˆut,G(xt...

2022
[19]

In addition, the concavity off(x,y)onyimplies the function f(x,y)− δϵ 2dxD2y ∥y−y0∥2 is˜µΦ-strongly concave inywith parameter ˜µΦ = δϵ dxD2y

D.3 Proof of Corollary 5 Proof.Following the derivation of equation (87), we know that ˜FΦ (x,y;ξ)≜F(x,y;ξ)− δϵ 2dxD2y ∥y−y0∥2 46 is ˜LΦ mean-squared Lipschitz continuous with respect to(x,y), where ˜LΦ =L+ δϵ dxDy . In addition, the concavity off(x,y)onyimplies the function f(x,y)− δϵ 2dxD2y ∥y−y0∥2 is˜µΦ-strongly concave inywith parameter ˜µΦ = δϵ dxD2y...

2022