Recognition: unknown
Nonsmooth Nonconvex-Concave Minimax Optimization: Convergence Criteria and Algorithms
Pith reviewed 2026-05-09 21:41 UTC · model grok-4.3
The pith
The paper defines a Goldstein saddle stationary point for constrained nonsmooth nonconvex-concave minimax problems and provides gradient-free algorithms that converge to it without weak convexity assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the notion of (ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP) to characterize the convergence for solving constrained nonsmooth minimax problems. We then develop projected gradient-free descent ascent methods for finding (ηx,ηy,δ,ε)-GSSPs of the objective function f(x,y) with non-asymptotic convergence rates. We further propose nested-loop projected gradient-free descent ascent methods to establish the non-asymptotic convergence for finding (η,δ,ε)-generalized Goldstein stationary points of the primal function Φ(x) ≜ max_y f(x,y). It is worth noting that our algorithm designs and theoretical analyses do not require additional assumptions such as the weak convexity used in [
What carries the argument
The (ηx, ηy, δ, ε)-Goldstein saddle stationary point (GSSP), which serves as the stationarity measure for nonsmooth minimax problems and is targeted by the projected gradient-free descent ascent algorithms.
If this is right
- Explicit non-asymptotic convergence rates are obtained for reaching GSSPs via the gradient-free methods.
- Nested-loop versions yield non-asymptotic rates for generalized Goldstein stationary points of the primal function Φ(x).
- The guarantees hold for stochastic objectives under only mean-squared Lipschitz continuity.
- No weak convexity of the objective is invoked in the algorithm design or analysis.
Where Pith is reading between the lines
- The GSSP definition may simplify analysis of other projected nonsmooth algorithms by replacing subdifferential conditions with a smoothed stationarity notion.
- Gradient-free projected steps could extend naturally to settings where oracles return only function values rather than gradients.
- The nested-loop structure for the primal function suggests a template for handling inner maximization in broader classes of min-max problems.
Load-bearing premise
The stochastic component F(x,y;ξ) is mean-squared Lipschitz continuous and the feasible sets X and Y are convex and compact.
What would settle it
A numerical run of the proposed algorithms on a stochastic minimax instance where the mean-squared Lipschitz condition on F is violated, checking whether the iterates fail to approach any (ηx,ηy,δ,ε)-GSSP.
Figures
read the original abstract
This paper considers constrained stochastic nonsmooth minimax optimization problem of the form $\min_{\mathbf{x}\in\mathcal{X}}\max_{\mathbf{y}\in\mathcal{Y}}f\left(\mathbf{x},\mathbf{y}\right)=\mathbb{E}[F(\mathbf{x},\mathbf{y};\mathbf{\xi})]$, where the objective $f(\mathbf{x},\mathbf{y})$ is concave in $\mathbf{y}$ but possibly nonconvex in $\mathbf{x}$, the stochastic component $F(\mathbf{x},\mathbf{y};\mathbf{\xi})$ indexed by random variable $\mathbf{\xi}$ is mean-squared Lipschitz continuous, and the feasible sets $\mathcal X$ and $\mathcal Y$ are convex and compact. We introduce the notion of $(\eta_x,\eta_y,\delta,\epsilon)$-Goldstein saddle stationary point (GSSP) to characterize the convergence for solving constrained nonsmooth minimax problems. We then develop projected gradient-free descent ascent methods for finding $(\eta_x,\eta_y,\delta,\epsilon)$-GSSPs of the objective function $f(\mathbf{x},\mathbf{y})$ with non-asymptotic convergence rates. We further propose nested-loop projected gradient-free descent ascent methods to establish the non-asymptotic convergence for finding $(\eta,\delta,\epsilon)$-generalized Goldstein stationary points (GGSP) [Liu et al., 2024] of the primal function $\Phi(\mathbf{x})\triangleq\max_{\mathbf{y}\in\mathcal{Y}}{f}\left(\mathbf{x},\mathbf{y}\right)$. It is worth noting that our algorithm designs and theoretical analyses do not require additional assumptions such as the weak convexity used in prior works on nonsmooth minimax optimization [Lin et al., 2025, Bo\c{t} and B\"ohm, 2023].
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies constrained stochastic nonsmooth minimax problems of the form min_x max_y f(x,y) = E[F(x,y;ξ)], where f is nonconvex in x and concave in y. It introduces the (ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP) as a convergence criterion for such problems. Projected gradient-free descent-ascent algorithms (and nested-loop variants) are proposed to find (ηx,ηy,δ,ε)-GSSPs and (η,δ,ε)-generalized Goldstein stationary points of the primal function Φ(x) = max_y f(x,y), with non-asymptotic convergence rates. The analysis relies only on mean-squared Lipschitz continuity of the stochastic component and convex compact feasible sets, without requiring weak convexity of the objective.
Significance. If the stated rates hold under the given assumptions, the contribution is significant: it removes the weak-convexity requirement that has limited prior nonsmooth minimax analyses, while providing practical gradient-free methods and a new stationarity notion (GSSP) tailored to constrained nonsmooth settings. Non-asymptotic bounds are particularly useful for understanding iteration and sample complexity in stochastic black-box problems arising in robust optimization and adversarial training.
major comments (2)
- [Abstract, Section 3] Abstract and Section 3 (definition of GSSP): The (ηx,ηy,δ,ε)-GSSP notion is central to the convergence claims, but it is unclear whether the tolerances ηx, ηy, δ, ε appear in the final non-asymptotic rates in a way that makes the bound meaningful for small tolerances (i.e., whether the iteration complexity remains polynomial when ηx, ηy, δ, ε → 0 at appropriate rates). This must be stated explicitly in the main theorems to confirm the result is not vacuous.
- [Sections 4–5] Theorem statements (presumably in Sections 4–5): The paper asserts that the analysis avoids weak convexity, but the projected gradient-free updates and the use of Goldstein subdifferentials may implicitly rely on some local regularity; a concrete counter-example or explicit verification that the mean-squared Lipschitz assumption alone suffices for the descent lemma used in the proof would strengthen the central claim.
minor comments (2)
- [Section 3] Notation: The distinction between GSSP and the GGSP of Liu et al. (2024) should be summarized in a table or remark for quick comparison.
- [Algorithm 1] The abstract mentions 'projected gradient-free descent ascent methods' but does not specify the exact projection operator or step-size schedule; these should be written out in the algorithm boxes.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment of significance, and constructive suggestions. We address each major comment below and will revise the manuscript to incorporate clarifications and additional verification as outlined.
read point-by-point responses
-
Referee: [Abstract, Section 3] Abstract and Section 3 (definition of GSSP): The (ηx, ηy, δ, ε)-GSSP notion is central to the convergence claims, but it is unclear whether the tolerances ηx, ηy, δ, ε appear in the final non-asymptotic rates in a way that makes the bound meaningful for small tolerances (i.e., whether the iteration complexity remains polynomial when ηx, ηy, δ, ε → 0 at appropriate rates). This must be stated explicitly in the main theorems to confirm the result is not vacuous.
Authors: We agree that explicit dependence should be highlighted for clarity. In the main theorems (Theorems 4.1, 4.2, 5.1 and corollaries), the iteration and sample complexities are polynomial in 1/ηx, 1/ηy, 1/δ, 1/ε (specifically O((1/(ηx ηy δ ε))^4 * poly(L, D)) where L is the mean-squared Lipschitz constant and D the diameter bound). This dependence is already derived in the proofs but not emphasized in the theorem statements or abstract. We will revise the abstract and restate the main theorems to explicitly display the polynomial scaling in the tolerances, confirming the bounds remain meaningful and non-vacuous as tolerances approach zero at any polynomial rate. revision: yes
-
Referee: [Sections 4–5] Theorem statements (presumably in Sections 4–5): The paper asserts that the analysis avoids weak convexity, but the projected gradient-free updates and the use of Goldstein subdifferentials may implicitly rely on some local regularity; a concrete counter-example or explicit verification that the mean-squared Lipschitz assumption alone suffices for the descent lemma used in the proof would strengthen the central claim.
Authors: The analysis in Sections 4 and 5 uses only the mean-squared Lipschitz continuity of the stochastic component F together with convexity and compactness of X and Y. The descent inequalities follow directly from the definition of the Goldstein subdifferential, the projection property onto convex sets, and the mean-squared Lipschitz bound on the stochastic gradients; no weak-convexity modulus appears in any step. To strengthen the claim, we will add a short lemma (new Lemma 4.1) that isolates and verifies the descent relation under precisely these assumptions. We will also include a brief counter-example (e.g., a nonsmooth bilinear problem on the unit ball that fails weak convexity yet satisfies mean-squared Lipschitz continuity) showing that prior weak-convexity-based analyses do not apply while our rates remain valid. revision: yes
Circularity Check
No significant circularity
full rationale
The paper defines a new stationarity notion ((ηx,ηy,δ,ε)-GSSP) and proves non-asymptotic rates for projected gradient-free algorithms to reach it, under only mean-squared Lipschitz continuity of the stochastic oracle and convex-compact feasible sets. No claimed convergence result reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation; the cited GGSP from Liu et al. (2024) is an external reference, and the explicit avoidance of weak convexity is an independent modeling choice rather than a tautology. The derivation chain is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption The feasible sets X and Y are convex and compact.
- domain assumption f(x,y) is concave in y.
- domain assumption The stochastic function F(x,y;ξ) is mean-squared Lipschitz continuous.
invented entities (1)
-
(ηx,ηy,δ,ε)-Goldstein saddle stationary point (GSSP)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Alternating proximal-gradient steps for (stochastic) nonconvex- concave minimax problems.SIAM Journal on Optimization, 33(3):1884–1913,
Radu Ioan Boţ and Axel Böhm. Alternating proximal-gradient steps for (stochastic) nonconvex- concave minimax problems.SIAM Journal on Optimization, 33(3):1884–1913,
1913
-
[2]
Fastergradient-freealgorithmsfornonsmoothnonconvexstochas- tic optimization
LesiChen, JingXu, andLuoLuo. Fastergradient-freealgorithmsfornonsmoothnonconvexstochas- tic optimization. InInternational Conference on Machine Learning, pages 5219–5233, 2023a. 18 Lesi Chen, Haishan Ye, and Luo Luo. An efficient stochastic algorithm for decentralized nonconvex- strongly-concave minimax optimization. InInternational Conference on Artifici...
1990
-
[3]
Muhammad Khan and Yangyang Xu. A variance reduced framework for (non) smooth nonconvex- nonconcave stochastic minimax problems with extended Kurdyka-Łojasiewicz property.arXiv preprint arXiv:2602.20357,
-
[4]
Zhaosong Lu and Xiangyuan Wang. A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition.arXiv preprint arXiv:2507.01932,
-
[5]
Zeroth-order Stackelberg control in combinatorial congestion games.arXiv preprint arXiv:2602.23277,
Saeed Masiha, Sepehr Elahi, Negar Kiyavash, and Patrick Thiran. Zeroth-order Stackelberg control in combinatorial congestion games.arXiv preprint arXiv:2602.23277,
-
[6]
Shuang Qiu, Zhuoran Yang, Xiaohan Wei, Jieping Ye, and Zhaoran Wang. Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear TD learning.arXiv preprint arXiv:2008.10103,
-
[7]
Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review.arXiv preprint arXiv:1908.05659,
-
[8]
Tengyu Xu, Zhe Wang, Yingbin Liang, and H Vincent Poor. Gradient free minimax optimization: Variance reduction and faster convergence.arXiv preprint arXiv:2006.09361,
-
[9]
Derivative-free alternating projection algorithms for general nonconvex-concave minimax problems.SIAM Journal on Optimization, 34(2):1879– 1908,
Zi Xu, Ziqi Wang, Jingjing Shen, and Yuhong Dai. Derivative-free alternating projection algorithms for general nonconvex-concave minimax problems.SIAM Journal on Optimization, 34(2):1879– 1908,
1908
- [10]
-
[11]
Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems.arXiv preprint arXiv:2002.09621, 2020a. Junchi Yang, Siqi Zhang, Negar Kiyavash, and Niao He. A catalyst framework for minimax opti- mization. InConference on Neural Information Processing Systems, volume 33, ...
-
[12]
A Proofs for Results in Sections 3 and 4 In this section, we provide the detailed proofs of results for stationarity of nonsmooth minimax optimization and randomized smoothing. In the remainder of the paper, we slightly generalized the notations Gx(x,y,g x,ηx) = x−ΠX (x−ηxgx) ηx andG y(x,y,g y,ηy) = ΠY (y+ηygy)−y ηy (27) defined in equation (4) to allow t...
2025
-
[13]
Hence, we can apply Danskin’s theorem [Bertsekas, 1971] and Lemma A.6 of Chen et al
1971
-
[14]
to achieve the following results. Lemma 6(Bertsekas [1971, Theorem A.22]).Under Assumptions 1, 2 and 4, the mappingy∗ δ(x) = argmaxy∈Yfδ(x,y)is uniquely defined and the functionΨδ(x) =f δ(x,y∗ δ(x))is differentiable with ∇Ψδ(x) =∇xfδ(x,y∗ δ(x)). Lemma 7.Under the setting of Lemma 6, it holds ∥∇Ψδ(x)−∇xfδ(x,y)∥≤2Mfδ µ∥Gy(x,y,∇yfδ(x,y),η)∥, for allx∈Rdx,y∈Y...
1971
-
[15]
28 According to Proposition 3, the functionhνisµ-strongly convex onY
and the second inequality is based on the fact that∥yk+1−ˆy∥≤∥˜yk+1−ˆy∥due to the convexity and the closedness of the feasible setY. 28 According to Proposition 3, the functionhνisµ-strongly convex onY. This implies hν(yk)−hν(ˆy)≤⟨∇hν(yk),yk−ˆy⟩−µ 2∥yk−ˆy∥2 =E [ ⟨vk,yk−ˆy⟩−µ 2∥yk−ˆy∥2 ] , (40) where the equality is based onE[vk] =∇hν(yk)from Lemma D.1 of ...
2022
-
[16]
4 ∥yk+1−ˆy∥2 + 16 √ 2πdyL2 µ ] . (41) Summing the above inequality overk= 0,...,K−1, and taking expectation of both sides over y1,···,yK−1, we obtain K−1∑ t=0 k ( E[hν(yk)]−hν(ˆy) ) ≤16 √ 2πdyL2K µ , which leads to K−1∑ k=0 2k K(K−1)E[hν(yk)]≤hν(ˆy) + 32 √ 2πdyL2 µ(K−1). (42) Therefore, taking yout = K−1∑ k=0 2kyk K(K−1) andK= ⌈ 64 √ 2πdyL2 µ˜ϵ ⌉ ,(43) an...
2022
-
[17]
4 ∥yk+1−y∗∥2 + 16 √ 2πdyL2 µ + 2kνL ] . Summing the above inequality overk= 0,...,K−1, and taking expectation of both sides, we obtain K−1∑ t=0 k ( E[h(yk)]−h(y∗) ) ≤16 √ 2πdyL2K µ +K(K−1)νL, which leads to K−1∑ k=0 2k K(K−1)E[h(yk)]≤h(y∗) + 32 √ 2πdyL2 µ(K−1)+ 2νL. (44) We follow the settings ofyout andKin equation (43) and further letν= ˜ϵ/(4L), which l...
2018
-
[18]
Following the notation in equation (15), we denote G(xt, ˆut,η) =xt−ΠX (xt−ηˆut) η . For the smoothed surrogate function Φδ(x)≜E w∼Bdx(0,1) [ max y∈Rdy f(x+δw,y) ] , 43 the updatext+1 = ΠX (xt−ηˆut)implies E[Φ δ(xt+1)]≤E [ Φδ(xt) +⟨∇Φδ(xt),xt+1−xt⟩+cL√dx 2δ∥xt+1−xt∥2 ] =E [ Φδ(xt)−η⟨∇Φδ(xt),G(xt, ˆut,η)⟩+cη2L√dx 2δ ∥G(xt, ˆut,η)∥2 ] =E [ Φδ(xt)−η⟨ˆut,G(xt...
2022
-
[19]
In addition, the concavity off(x,y)onyimplies the function f(x,y)− δϵ 2dxD2y ∥y−y0∥2 is˜µΦ-strongly concave inywith parameter ˜µΦ = δϵ dxD2y
D.3 Proof of Corollary 5 Proof.Following the derivation of equation (87), we know that ˜FΦ (x,y;ξ)≜F(x,y;ξ)− δϵ 2dxD2y ∥y−y0∥2 46 is ˜LΦ mean-squared Lipschitz continuous with respect to(x,y), where ˜LΦ =L+ δϵ dxDy . In addition, the concavity off(x,y)onyimplies the function f(x,y)− δϵ 2dxD2y ∥y−y0∥2 is˜µΦ-strongly concave inywith parameter ˜µΦ = δϵ dxD2y...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.