On Finding Small Hyper-Gradients in Bilevel Optimization: Hardness Results and Improved Analysis

Jing Xu; Jingzhao Zhang; Lesi Chen

arxiv: 2301.00712 · v9 · submitted 2023-01-02 · 🧮 math.OC · cs.AI· cs.LG

On Finding Small Hyper-Gradients in Bilevel Optimization: Hardness Results and Improved Analysis

Lesi Chen , Jing Xu , Jingzhao Zhang This is my paper

Pith reviewed 2026-05-24 10:14 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LG

keywords bilevel optimizationhyper-gradientPolyak-Lojasiewicz conditioncomplexity boundsnonconvex optimizationstochastic optimizationhyperparameter tuning

0 comments

The pith

A first-order method finds small hyper-gradients in nonconvex bilevel problems under the PL condition with near-optimal rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first proves that locating stationary points of the hyper-objective is intractable for zero-respecting algorithms in the nonconvex-convex bilevel setting. It then shifts to nonconvex-nonconvex bilevel problems in which the lower-level objective obeys the Polyak-Lojasiewicz condition. In this regime a basic first-order procedure attains complexity bounds of order epsilon to the minus two in the deterministic case, minus four when gradients are partially stochastic, and minus six when fully stochastic. The first two bounds match known lower bounds up to logarithmic factors. These results supply concrete iteration counts for bilevel tasks such as hyperparameter tuning when the inner problem meets the stated structural assumption.

Core claim

In nonconvex-convex bilevel optimization, finding an epsilon-stationary point of the hyper-objective is intractable for zero-respecting algorithms. When the lower-level function instead satisfies the Polyak-Lojasiewicz condition, a simple first-order algorithm reaches an epsilon-stationary hyper-gradient with tilde-O(epsilon^{-2}) complexity in the deterministic setting, tilde-O(epsilon^{-4}) in the partially stochastic setting, and tilde-O(epsilon^{-6}) in the fully stochastic setting; the first two rates are optimal up to logarithmic factors.

What carries the argument

The Polyak-Lojasiewicz condition imposed on the lower-level objective, which replaces strong convexity and enables the improved first-order complexity analysis for the hyper-objective.

If this is right

Deterministic setting requires only tilde-O(epsilon^{-2}) iterations.
Partially stochastic setting requires tilde-O(epsilon^{-4}) iterations and matches the lower bound up to logs.
Fully stochastic setting requires tilde-O(epsilon^{-6}) iterations.
The same algorithm and analysis apply directly to nonconvex-nonconvex bilevel problems whose lower level obeys PL.
Hyperparameter tuning, architecture search, and meta-learning become tractable under the PL assumption on the inner problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If real inner problems approximately satisfy PL, the same first-order procedure should remain practical even when exact PL fails.
The hardness result indicates that algorithms outside the zero-respecting class will be required for general nonconvex-convex bilevel problems.
The gap between the fully stochastic rate and the lower bound suggests room for tighter analysis or variance-reduction techniques.
The results supply explicit iteration budgets that can be used to compare bilevel solvers against single-level baselines on concrete tasks.

Load-bearing premise

The lower-level function satisfies the Polyak-Lojasiewicz condition.

What would settle it

An explicit nonconvex-convex bilevel instance on which a zero-respecting algorithm finds an epsilon-stationary hyper-gradient in polynomial time, or a PL-satisfying instance on which the stated first-order algorithm requires more than the claimed number of iterations.

read the original abstract

Bilevel optimization reveals the inner structure of otherwise oblique optimization problems, such as hyperparameter tuning, neural architecture search, and meta-learning. A common goal in bilevel optimization is to minimize a hyper-objective that implicitly depends on the solution set of the lower-level function. Although this hyper-objective approach is widely used, its theoretical properties have not been thoroughly investigated in cases where the lower-level functions lack strong convexity. In this work, we first provide hardness results to show that the goal of finding stationary points of the hyper-objective for nonconvex-convex bilevel optimization can be intractable for zero-respecting algorithms. Then we study a class of tractable nonconvex-nonconvex bilevel problems when the lower-level function satisfies the Polyak-{\L}ojasiewicz (PL) condition. We show a simple first-order algorithm can achieve better complexity bounds of $\tilde{\mathcal{O}}(\epsilon^{-2})$, $\tilde{\mathcal{O}}(\epsilon^{-4})$ and $\tilde{\mathcal{O}}(\epsilon^{-6})$ in the deterministic, partially stochastic, and fully stochastic setting respectively. The complexities in the first two cases are optimal up to logarithmic factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hardness for zero-respecting algorithms on nonconvex-convex bilevel plus improved first-order rates under PL, with optimality up to logs in the deterministic and partial-stochastic cases.

read the letter

The main takeaway is that the paper shows finding stationary points of the hyper-objective is hard for zero-respecting algorithms in the nonconvex-convex setting, then gives a simple first-order algorithm with rates of tilde O(eps^{-2}), tilde O(eps^{-4}), and tilde O(eps^{-6}) when the lower level satisfies PL, claiming optimality up to logs in the first two regimes. The hardness result is positioned as new for that algorithm class, and the positive results improve on prior bounds under the stated assumption. The work is clear about conditioning everything on PL for the tractable case and separates the hardness claim from the algorithmic part. The analysis appears to follow standard stochastic optimization techniques for the upper bounds. The PL assumption is restrictive for genuinely nonconvex lower levels, which limits how far the positive rates extend, and the fully stochastic bound is not claimed to be optimal. The hardness applies specifically to zero-respecting methods, which is a natural restriction but leaves open whether other algorithm families could do better. No obvious circularity or unfalsifiable claims show up in the framing. This is a paper for researchers already working on bilevel optimization theory, hyperparameter tuning analysis, or stochastic nonconvex optimization. Someone building or analyzing algorithms for meta-learning or architecture search would find the complexity statements useful if they already accept PL-type conditions. It has enough concrete hardness and rate results to merit a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The manuscript establishes hardness results showing that finding stationary points of the hyper-objective in nonconvex-convex bilevel optimization is intractable for zero-respecting algorithms. It then considers nonconvex-nonconvex bilevel problems where the lower-level function satisfies the Polyak-Łojasiewicz condition and presents a simple first-order algorithm with improved complexity bounds: tilde O(ε^{-2}) deterministic, tilde O(ε^{-4}) partially stochastic, tilde O(ε^{-6}) fully stochastic, claiming optimality up to logarithmic factors for the first two cases.

Significance. If the results hold, this work clarifies the boundary between tractable and intractable regimes in bilevel optimization and supplies improved first-order rates under the PL assumption, a standard condition that renders many nonconvex problems tractable. The matching upper/lower bounds (up to logs) claimed for the deterministic and partially stochastic settings would constitute a solid theoretical contribution if the lower-bound constructions are tight.

minor comments (3)

[Abstract] Abstract: the phrase 'zero-respecting algorithms' appears without a parenthetical gloss or forward reference to its definition; a one-sentence clarification would aid readers who encounter the term for the first time.
[Abstract] Abstract: the notation tilde O is used for all three regimes; explicitly stating what logarithmic factors are absorbed (or citing the precise theorem where they are defined) would remove ambiguity about the optimality claims.
The PL assumption is stated as sufficient for tractability, but the manuscript should confirm whether the algorithm analysis requires any additional regularity (e.g., smoothness of the hyper-objective or bounded variance) beyond what is listed in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive evaluation of our work. We are pleased that the significance of the hardness results for zero-respecting algorithms in nonconvex-convex bilevel optimization and the improved complexity bounds under the PL condition were recognized. Since the report contains no specific major comments requiring clarification or correction, we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper establishes hardness for nonconvex-convex bilevel problems under zero-respecting algorithms and derives complexity bounds for a tractable nonconvex-nonconvex subclass under the explicitly stated Polyak-Lojasiewicz assumption on the lower-level objective. These bounds are obtained via standard first-order analysis whose steps are independent of the target rates; the PL condition is introduced as an external assumption that renders the problem tractable rather than being derived from or equivalent to the claimed results. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Analysis rests on the Polyak-Lojasiewicz condition as the key domain assumption enabling tractability; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Lower-level function satisfies the Polyak-Lojasiewicz (PL) condition
This assumption is invoked to obtain the stated complexity bounds in the nonconvex-nonconvex setting.

pith-pipeline@v0.9.0 · 5739 in / 1290 out tokens · 27502 ms · 2026-05-24T10:14:35.173821+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

ITD (ITerative Differentiation ) (Gould et al

to get an estimator of ∇ ϕ (x) to apply gradient descent. ITD (ITerative Differentiation ) (Gould et al. , 2016; Franceschi et al., 2017; Shaban et al., 2019; Bolte et al., 2021) approximates ∇ y∗(x) by ∂y K(x)/∂x , where yK(x) is K-steps of gradient descent. AID (Approximate Im- plicit Differentiation) (Domke, 2012; Ghadimi and Wang, 2018; Pedregosa, 201...

work page 2016
[2]

ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

− f (x2,y 2) ≤ L2 (∥x1 − x2∥ + ∥y2 − y′ 1∥) ≤ (L1 + 1)L2∥x1 − x2∥. ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

work page
[3]

This implies that ϕ (x) is Lipschitz on Bδ(x1)

− f (x1,y 1) ≤ L2 (∥x1 − x2∥ + ∥y1 − y′ 2∥) ≤ (L1 + 1)L2∥x1 − x2∥. This implies that ϕ (x) is Lipschitz on Bδ(x1). (d). For any compact set K ⊆ Rdx, there exists L > 0 such that ϕ (x) isL-Lipschitz on K. Then for any x1,x 2 ∈ K, takingf (x,y ) = − dist(y,Y ∗(x1)) yields d1 =ϕ (x1) − ϕ (x2) ≤ L∥x1 − x2∥, By symmetric, we can also show that d2 ≤ L∥x1 − x2∥....

work page
[4]

− f (x2,y 2) ≤ Cf ( ∥x1 − x2∥ + ∥y2 − y′ 1∥ ) ≤ (κ + 1)Cf ∥x1 − x2∥, ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

work page
[5]

− f (x1,y 1) ≤ Cf ( ∥x1 − x2∥ + ∥y1 − y′ 2∥ ) ≤ (κ + 1)Cf ∥x1 − x2∥, 23 CHEN XU ZHANG This establishes the Lipschitz continuity of ϕ (x). (f). Without loss of generality, we assume Cf = 1 , otherwise we can scale f (x,y ) byCf to prove the result. Because f (x,y ) is globally Lipschitz, we can take K = Rdx in the proof of d. Then by the same arguments, we...

work page 2018
[6]

+ q− 1∑ j=1 (v[j] − v[j+1])2   ≤ 1 4  v2

work page
[7]

+ q− 1∑ j=1 (v[j] − v[j+1])2 +v2 [q]   24 ON FINDING SMALL HYPER -G RADIENTS IN BILEVEL OPTIMIZATION ≤ 1 4  v2

work page
[8]

This proves property c

+ q− 1∑ j=1 2(v2 [j] +v2 [j+1]) +v2 [q]   ≤ q∑ j=1 v2 [j] = ∥v∥2. This proves property c. Finally property d holds sinceA is tridiagonal. In bilevel problems, it is crucial to ﬁnd a point y that is close to Y ∗(x), instead of just achieving a small optimality gap g(x,y ) − g∗(x). However, it is difﬁcult for any ﬁrst-order algorithms to “locate” the mini...

work page
[9]

f (x,y ) isc1-Lipschitz iny on X × Rdy

work page
[10]

f (x,y ) hasc2-Lipschitz gradients on X × Rdy

work page
[11]

g(y) is a strictly convex quadratic and has 1-Lipschitz gradients

work page
[12]

F or this problem, any algorithm with a procedure as Algorith m 1 stays at xt = 0 for any iteration numbert ≤ T

The resulting hyper-objective is ϕ (x) = (x + 1)2/ 2. F or this problem, any algorithm with a procedure as Algorith m 1 stays at xt = 0 for any iteration numbert ≤ T . Proof Letdx = 1,dy =q = 2TK ,β = 1/ √ q and f (x,y ) = 2(x + 1)2r(y), g (y) = β 2hq(y/β ). wherehq(y) follows Deﬁnition E.1 andr(y) = ∑q j=q/ 2+1ψ (y[j]), whereψ : R → R is ψ (y) =     ...

work page 2016
[13]

By Lemma G.1, it sufﬁces to set the parameters in the inner loop as Equation ( 8)

shows that one can ﬁnd an ǫ-stationary point of ϕ (x) withT = O(ǫ− 2) outer-loop iterations. By Lemma G.1, it sufﬁces to set the parameters in the inner loop as Equation ( 8). But Kt requires the knowledge of δt. Next, we bound δt via a recursion which allows us to get rid of the prior knowled ge of δt. By Lemma G.1 and Lemma 4.1, dist2 ( y0 t+1,Y ∗ σ (xt...

work page

[1] [1]

ITD (ITerative Differentiation ) (Gould et al

to get an estimator of ∇ ϕ (x) to apply gradient descent. ITD (ITerative Differentiation ) (Gould et al. , 2016; Franceschi et al., 2017; Shaban et al., 2019; Bolte et al., 2021) approximates ∇ y∗(x) by ∂y K(x)/∂x , where yK(x) is K-steps of gradient descent. AID (Approximate Im- plicit Differentiation) (Domke, 2012; Ghadimi and Wang, 2018; Pedregosa, 201...

work page 2016

[2] [2]

ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

− f (x2,y 2) ≤ L2 (∥x1 − x2∥ + ∥y2 − y′ 1∥) ≤ (L1 + 1)L2∥x1 − x2∥. ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

work page

[3] [3]

This implies that ϕ (x) is Lipschitz on Bδ(x1)

− f (x1,y 1) ≤ L2 (∥x1 − x2∥ + ∥y1 − y′ 2∥) ≤ (L1 + 1)L2∥x1 − x2∥. This implies that ϕ (x) is Lipschitz on Bδ(x1). (d). For any compact set K ⊆ Rdx, there exists L > 0 such that ϕ (x) isL-Lipschitz on K. Then for any x1,x 2 ∈ K, takingf (x,y ) = − dist(y,Y ∗(x1)) yields d1 =ϕ (x1) − ϕ (x2) ≤ L∥x1 − x2∥, By symmetric, we can also show that d2 ≤ L∥x1 − x2∥....

work page

[4] [4]

− f (x2,y 2) ≤ Cf ( ∥x1 − x2∥ + ∥y2 − y′ 1∥ ) ≤ (κ + 1)Cf ∥x1 − x2∥, ϕ (x2) − ϕ (x1) ≤ f (x2,y ′

work page

[5] [5]

− f (x1,y 1) ≤ Cf ( ∥x1 − x2∥ + ∥y1 − y′ 2∥ ) ≤ (κ + 1)Cf ∥x1 − x2∥, 23 CHEN XU ZHANG This establishes the Lipschitz continuity of ϕ (x). (f). Without loss of generality, we assume Cf = 1 , otherwise we can scale f (x,y ) byCf to prove the result. Because f (x,y ) is globally Lipschitz, we can take K = Rdx in the proof of d. Then by the same arguments, we...

work page 2018

[6] [6]

+ q− 1∑ j=1 (v[j] − v[j+1])2   ≤ 1 4  v2

work page

[7] [7]

+ q− 1∑ j=1 (v[j] − v[j+1])2 +v2 [q]   24 ON FINDING SMALL HYPER -G RADIENTS IN BILEVEL OPTIMIZATION ≤ 1 4  v2

work page

[8] [8]

This proves property c

+ q− 1∑ j=1 2(v2 [j] +v2 [j+1]) +v2 [q]   ≤ q∑ j=1 v2 [j] = ∥v∥2. This proves property c. Finally property d holds sinceA is tridiagonal. In bilevel problems, it is crucial to ﬁnd a point y that is close to Y ∗(x), instead of just achieving a small optimality gap g(x,y ) − g∗(x). However, it is difﬁcult for any ﬁrst-order algorithms to “locate” the mini...

work page

[9] [9]

f (x,y ) isc1-Lipschitz iny on X × Rdy

work page

[10] [10]

f (x,y ) hasc2-Lipschitz gradients on X × Rdy

work page

[11] [11]

g(y) is a strictly convex quadratic and has 1-Lipschitz gradients

work page

[12] [12]

F or this problem, any algorithm with a procedure as Algorith m 1 stays at xt = 0 for any iteration numbert ≤ T

The resulting hyper-objective is ϕ (x) = (x + 1)2/ 2. F or this problem, any algorithm with a procedure as Algorith m 1 stays at xt = 0 for any iteration numbert ≤ T . Proof Letdx = 1,dy =q = 2TK ,β = 1/ √ q and f (x,y ) = 2(x + 1)2r(y), g (y) = β 2hq(y/β ). wherehq(y) follows Deﬁnition E.1 andr(y) = ∑q j=q/ 2+1ψ (y[j]), whereψ : R → R is ψ (y) =     ...

work page 2016

[13] [13]

By Lemma G.1, it sufﬁces to set the parameters in the inner loop as Equation ( 8)

shows that one can ﬁnd an ǫ-stationary point of ϕ (x) withT = O(ǫ− 2) outer-loop iterations. By Lemma G.1, it sufﬁces to set the parameters in the inner loop as Equation ( 8). But Kt requires the knowledge of δt. Next, we bound δt via a recursion which allows us to get rid of the prior knowled ge of δt. By Lemma G.1 and Lemma 4.1, dist2 ( y0 t+1,Y ∗ σ (xt...

work page