Fully First-Order Algorithms for Online Bilevel Optimization

Cheng Chen; Tingkai Jia

arxiv: 2602.11665 · v2 · submitted 2026-02-12 · 💻 cs.LG · math.OC

Fully First-Order Algorithms for Online Bilevel Optimization

Tingkai Jia , Cheng Chen This is my paper

Pith reviewed 2026-05-16 02:21 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords online bilevel optimizationfirst-order methodsregret boundsnonconvex-strongly convexLagrangian methodshypergradient avoidancestochastic bilevel optimization

0 comments

The pith

A fully first-order algorithm solves nonconvex-strongly convex online bilevel optimization without Hessian-vector products.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops algorithms for online bilevel optimization where the outer problem is nonconvex and the inner problem is strongly convex, using only first-order information. It reformulates the bilevel problem as a single-level online optimization task with inequality constraints and builds a sequence of Lagrangian functions to bypass hypergradients and Hessian-vector products. The resulting method delivers regret O(1 + V_T + H_{2,T}) across O(T log T) iterations, where V_T captures changes in function values and H_{2,T} captures drift in the inner optimal solution. Variants include a single-loop version with sublinear regret and an adaptive scheme that removes dependence on H_{2,T} to reach O(log T + V_T). The same first-order approach also yields a regret bound O(T^{2/3}(1 + σ²) + V_T + H_{2,T}) under stochastic noise.

Core claim

By converting the original bilevel problem into an equivalent single-level online problem with inequality constraints and constructing Lagrangian functions, the paper obtains a fully first-order algorithm that achieves regret O(1 + V_T + H_{2,T}) with O(T log T) total iterations while avoiding all Hessian-vector product oracles.

What carries the argument

Reformulation of the bilevel problem as a single-level online problem with inequality constraints together with a sequence of Lagrangian functions, which enables first-order updates without implicit differentiation.

If this is right

The algorithm applies directly to online bilevel tasks where Hessian-vector products are unavailable or prohibitively expensive.
A single-loop variant yields sublinear regret once gradient-variation terms are included in the analysis.
An adaptive inner-iteration scheme produces regret O(log T + V_T) independent of inner-solution drift.
Under stochastic gradients the same first-order framework guarantees regret O(T^{2/3}(1 + σ²) + V_T + H_{2,T}).
Numerical experiments confirm that the method reaches the stated regret rates on standard bilevel benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constraint-based reformulation may allow existing first-order constrained optimizers to be applied to a broader class of bilevel problems.
The removal of H_{2,T} dependence in the adaptive variant could enable fully online tuning without prior knowledge of solution drift.
These regret expressions suggest concrete iteration budgets for real-time bilevel learning applications where drift is measurable.
Similar Lagrangian reformulations might transfer to other online hierarchical optimization settings that currently rely on second-order oracles.

Load-bearing premise

The reformulation of the original bilevel problem into a single-level online problem with inequality constraints must preserve exact equivalence, and the inner-level problem must remain strongly convex at every online step.

What would settle it

Execute the algorithm on an online bilevel instance where the inner optimal solution exhibits large drift (high H_{2,T}) and verify whether cumulative regret stays within O(1 + V_T + H_{2,T}).

read the original abstract

In this work, we study nonconvex-strongly convex online bilevel optimization (OBO) using only first-order oracle. Existing OBO algorithms are mainly based on hypergradient descent, which requires access to a Hessian-vector product (HVP) oracle and potentially incurs high computational costs. By reformulating the original OBO problem as a single-level online problem with inequality constraints and constructing a sequence of Lagrangian function, we eliminate the need for HVPs arising from implicit differentiation. Specifically, we propose a fully first-order algorithm for OBO, and provide theoretical guarantees showing that it achieves regret of $O(1 + V_T + H_{2,T})$ with a total of $O(T\log T)$ iterations, where $V_T$ measures the variation in function values and $H_{2,T}$ characterizes the drift variation of the inner-level optimal solution. We also establish a sublinear regret bound under the single-loop structure by introducing additional gradient-variation terms. Furthermore, we develop an improved variant with an adaptive inner-iteration scheme, which removes the dependence on $H_{2,T}$ and achieves regret of $O(\log T + V_T)$. Finally, under the stochastic OBO setting, we establish the regret bound for the fully first-order algorithm, i.e., $O(T^{2/3}(1 + \sigma^2) + V_T + H_{2,T})$. Numerical experiments demonstrate the feasibility of our algorithm and support our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper removes the HVP bottleneck in online bilevel optimization with a Lagrangian reformulation of the bilevel problem into a constrained single-level one, delivering regret bounds that track variation terms.

read the letter

The core advance is a fully first-order method for nonconvex-strongly convex online bilevel optimization. By rewriting the problem as a single-level online program with inequality constraints drawn from the inner optimality condition and then working with a sequence of Lagrangians, the approach avoids any Hessian-vector products. They obtain a regret of O(1 + V_T + H_{2,T}) using O(T log T) total iterations, plus an adaptive inner-iteration variant that removes the H_{2,T} dependence and reaches O(log T + V_T). A stochastic version is also given with the expected O(T^{2/3}(1 + sigma^2) + V_T + H_{2,T}) bound. Experiments are reported to back the claims up numerically. This directly tackles the scalability issue that hypergradient methods face when models get larger, and the dependence on natural variation measures rather than artificial parameters is a reasonable modeling choice for the online setting. The reformulation step is the part that needs the most scrutiny. The stress-test concern is real: when the inner optimum drifts, the inequality constraints derived from stationarity must continue to capture the exact bilevel trajectory, or else the Lagrangian updates can accumulate a gap that the regret analysis does not fully absorb. If the full proofs show that the duality gap stays controlled under the stated assumptions on strong convexity and the variation terms, the bounds hold; otherwise the central guarantee weakens. The abstract itself gives no proof sketches, so the paper stands or falls on the detailed argument in the body. This work is aimed at researchers already working on online bilevel or hypergradient methods who need cheaper per-iteration costs. It is worth sending to peer review because it attacks a concrete computational barrier with a new algorithmic route and supplies matching theory, even if the reformulation equivalence will require careful referee attention.

Referee Report

2 major / 2 minor

Summary. The paper proposes fully first-order algorithms for nonconvex-strongly convex online bilevel optimization (OBO) by reformulating the bilevel problem as a single-level online program with inequality constraints derived from the inner optimality condition, then constructing a sequence of Lagrangian functions to eliminate Hessian-vector products. It claims a regret bound of O(1 + V_T + H_{2,T}) using O(T log T) iterations for the base algorithm, a sublinear bound under single-loop structure, an improved adaptive-inner-iteration variant achieving O(log T + V_T), and a stochastic regret of O(T^{2/3}(1 + σ²) + V_T + H_{2,T}), supported by numerical experiments.

Significance. If the reformulation equivalence holds exactly and the regret bounds are rigorously established, the work would be significant for providing practical first-order methods for OBO that avoid costly HVPs, with regret scaling only in natural variation measures V_T and H_{2,T}. The adaptive variant removing H_{2,T} dependence and the stochastic extension are particularly useful strengths; the numerical experiments add empirical support.

major comments (2)

[Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.
[Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.

minor comments (2)

[Introduction] The precise definitions of the variation measures V_T and H_{2,T} should be stated explicitly in the introduction or notation section rather than only in the abstract, to improve readability.
[Experiments] In the numerical experiments, the description of baselines and hyperparameter choices could be expanded for reproducibility, and error bars or multiple runs should be reported to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, providing clarifications on the reformulation and theoretical assumptions while committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.

Authors: We thank the referee for highlighting this subtlety. In the reformulation, the inequality constraints are derived directly from the KKT optimality conditions of the inner problem evaluated at the current outer variable x_t at each time step t; they are therefore time-varying and updated dynamically with the outer iterates rather than held static within an inner loop. The sequence of Lagrangian functions is constructed to incorporate these updates, and the analysis explicitly bounds any accumulated constraint violation or duality gap by the inner-solution drift measure H_{2,T}. This ensures equivalence holds in the online sense up to the variation terms already present in the regret bound. We will revise the reformulation section to include an explicit paragraph clarifying the dynamic constraint construction and its relation to H_{2,T}. revision: partial
Referee: [Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.

Authors: The inner problem is assumed strongly convex in y with a uniform parameter μ > 0 for any fixed outer variable x, which is the standard setting for nonconvex-strongly-convex bilevel problems. The outer updates induce changes in the inner objective, but these are controlled through the variation measures V_T and H_{2,T} that appear in the regret; the analysis does not require μ itself to vary with time. We agree that an explicit remark on error propagation under the uniform-μ assumption would improve clarity. We will add a short discussion paragraph in the theoretical guarantees section that recalls the uniform strong-convexity assumption and sketches how the existing variation bounds already propagate inner-solver errors without needing a drifting-μ analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central regret bounds O(1 + V_T + H_{2,T}) and variants are defined directly in terms of externally specified variation measures V_T (function-value variation) and H_{2,T} (inner-optima drift), which are not constructed from fitted parameters or self-referential definitions inside the algorithm. The reformulation of OBO as a single-level online constrained problem is stated as an equivalence via Lagrangian sequence construction; no equation reduces the claimed regret to a tautological renaming of the input data or to a self-citation chain. The analysis proceeds from standard first-order oracle assumptions and online convex optimization primitives without load-bearing self-citations or ansatz smuggling. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions for bilevel optimization and online regret analysis; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Inner-level problem is strongly convex for all outer parameters
Required to ensure unique inner solutions and to control drift term H_{2,T}.
domain assumption Reformulation to single-level constrained problem is equivalent
Invoked to justify applying first-order methods without implicit differentiation.

pith-pipeline@v0.9.0 · 5558 in / 1249 out tokens · 33078 ms · 2026-05-16T02:21:14.231389+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Non-convex bilevel games with critical point selection maps, 2022

Michael Arbel and Julien Mairal. Non-convex bilevel games with critical point selection maps, 2022

work page 2022
[2]

Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

Jason Bohne, Ieva Petrulionyte, Michael Arbel, Julien Mairal, and Paweł Polak. Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

work page arXiv 2026
[3]

Online nonconvex bilevel optimization with bregman divergences, 2024

Jason Bohne, David Rosenberg, Gary Kazantsev, and Pawel Polak. Online nonconvex bilevel optimization with bregman divergences, 2024

work page 2024
[4]

Mathematical programs with optimization problems in the constraints

Jerome Bracken and James T McGill. Mathematical programs with optimization problems in the constraints. Operations research, 21(1):37–44, 1973

work page 1973
[5]

convex until proven guilty

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 654–663. PMLR, 2017

work page 2017
[6]

Duchi, Oliver Hinder, and Aaron Sidford

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020

work page 2020
[7]

Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

work page arXiv 2023
[8]

Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

work page 2025
[9]

On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

Lesi Chen, Jing Xu, and Jingzhao Zhang. On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

work page 2025
[10]

Forward and reverse gradient-based hyperparameter optimization, 2017

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017

work page 2017
[11]

Approximation methods for bilevel programming, 2018

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming, 2018

work page 2018
[12]

On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

work page 2016
[13]

On the iteration complexity of hypergradient computation, 2020

Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation, 2020

work page 2020
[14]

Efficient regret minimization in non-convex games, 2017

Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games, 2017. 8

work page 2017
[15]

On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

Feihu Huang. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

work page 2023
[16]

Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

Yu Huang, Yuan Cheng, Yingbin Liang, and Longbo Huang. Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

work page 2023
[17]

Will bilevel optimizers benefit from loops, 2022

Kaiyi Ji, Mingrui Liu, Yingbin Liang, and Lei Ying. Will bilevel optimizers benefit from loops, 2022

work page 2022
[18]

Bilevel optimization: Convergence analysis and enhanced design, 2021

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design, 2021

work page 2021
[19]

Achieving better local regret bound for online non-convex bilevel optimization, 2026

Tingkai Jia, Haiguang Wang, and Cheng Chen. Achieving better local regret bound for online non-convex bilevel optimization, 2026

work page 2026
[20]

A fully first-order method for stochastic bilevel optimization, 2023

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowak. A fully first-order method for stochastic bilevel optimization, 2023

work page 2023
[21]

Non-convex bilevel optimization with time- varying objective functions, 2023

Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time- varying objective functions, 2023

work page 2023
[22]

Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

work page 2021
[23]

A first-order augmented lagrangian method for constrained minimax optimization, 2024

Zhaosong Lu and Sanyou Mei. A first-order augmented lagrangian method for constrained minimax optimization, 2024

work page 2024
[24]

First-order penalty methods for bilevel optimization, 2024

Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization, 2024

work page 2024
[25]

Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

work page 2019
[26]

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning, 2015

work page 2015
[27]

Stochastic regret guarantees for online zeroth- and first-order bilevel optimization

Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, and George Michailidis. Stochastic regret guarantees for online zeroth- and first-order bilevel optimization. OpenReview, 2025. Available at https: //openreview.net/forum?id=nYrj9ZwgYk

work page 2025
[28]

On first-order meta-learning algorithms, 2018

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms, 2018

work page 2018
[29]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR

work page 2016
[30]

Hyperparameter optimization with approximate gradient, 2022

Fabian Pedregosa. Hyperparameter optimization with approximate gradient, 2022

work page 2022
[31]

Truncated back-propagation for bilevel optimization, 2019

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization, 2019

work page 2019
[32]

On penalty-based bilevel gradient descent method, 2025

Han Shen, Quan Xiao, and Tianyi Chen. On penalty-based bilevel gradient descent method, 2025

work page 2025
[33]

Es-maml: Simple hessian-free meta learning, 2020

Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning, 2020

work page 2020
[34]

A primal-dual approach to bilevel optimization with multiple inner minima, 2022

Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A primal-dual approach to bilevel optimization with multiple inner minima, 2022

work page 2022
[35]

On the convergence theory for hessian-free bilevel algorithms, 2022

Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms, 2022

work page 2022
[36]

Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

Davoud Ataee Tarzanagh, Parvin Nazari, Bojian Hou, Li Shen, and Laura Balzano. Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

work page 2024
[37]

Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. Multimodal model-agnostic meta-learning via task-aware modulation, 2019

work page 2019
[38]

Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

work page 2025
[39]

Bome! bilevel optimization made easy: A simple first-order approach, 2022

Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach, 2022. 9 A Preliminary Lemmas We list all the lemmas used in the proof below. Lemma A.1(Lemma 16 in [36]).Under Assumptions 2.1, 2.2, for allt∈[T], anyx 1,x 2 ∈ X, we have ∥y∗ t (x1)−y ∗ t (x2)∥ ≤ Lg,1 µg ∥x1 −x 2∥. Lemma A.2(Lem...

work page 2022

[1] [1]

Non-convex bilevel games with critical point selection maps, 2022

Michael Arbel and Julien Mairal. Non-convex bilevel games with critical point selection maps, 2022

work page 2022

[2] [2]

Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

Jason Bohne, Ieva Petrulionyte, Michael Arbel, Julien Mairal, and Paweł Polak. Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

work page arXiv 2026

[3] [3]

Online nonconvex bilevel optimization with bregman divergences, 2024

Jason Bohne, David Rosenberg, Gary Kazantsev, and Pawel Polak. Online nonconvex bilevel optimization with bregman divergences, 2024

work page 2024

[4] [4]

Mathematical programs with optimization problems in the constraints

Jerome Bracken and James T McGill. Mathematical programs with optimization problems in the constraints. Operations research, 21(1):37–44, 1973

work page 1973

[5] [5]

convex until proven guilty

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 654–663. PMLR, 2017

work page 2017

[6] [6]

Duchi, Oliver Hinder, and Aaron Sidford

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020

work page 2020

[7] [7]

Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

work page arXiv 2023

[8] [8]

Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

work page 2025

[9] [9]

On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

Lesi Chen, Jing Xu, and Jingzhao Zhang. On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

work page 2025

[10] [10]

Forward and reverse gradient-based hyperparameter optimization, 2017

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017

work page 2017

[11] [11]

Approximation methods for bilevel programming, 2018

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming, 2018

work page 2018

[12] [12]

On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

work page 2016

[13] [13]

On the iteration complexity of hypergradient computation, 2020

Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation, 2020

work page 2020

[14] [14]

Efficient regret minimization in non-convex games, 2017

Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games, 2017. 8

work page 2017

[15] [15]

On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

Feihu Huang. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

work page 2023

[16] [16]

Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

Yu Huang, Yuan Cheng, Yingbin Liang, and Longbo Huang. Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

work page 2023

[17] [17]

Will bilevel optimizers benefit from loops, 2022

Kaiyi Ji, Mingrui Liu, Yingbin Liang, and Lei Ying. Will bilevel optimizers benefit from loops, 2022

work page 2022

[18] [18]

Bilevel optimization: Convergence analysis and enhanced design, 2021

Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design, 2021

work page 2021

[19] [19]

Achieving better local regret bound for online non-convex bilevel optimization, 2026

Tingkai Jia, Haiguang Wang, and Cheng Chen. Achieving better local regret bound for online non-convex bilevel optimization, 2026

work page 2026

[20] [20]

A fully first-order method for stochastic bilevel optimization, 2023

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowak. A fully first-order method for stochastic bilevel optimization, 2023

work page 2023

[21] [21]

Non-convex bilevel optimization with time- varying objective functions, 2023

Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time- varying objective functions, 2023

work page 2023

[22] [22]

Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

work page 2021

[23] [23]

A first-order augmented lagrangian method for constrained minimax optimization, 2024

Zhaosong Lu and Sanyou Mei. A first-order augmented lagrangian method for constrained minimax optimization, 2024

work page 2024

[24] [24]

First-order penalty methods for bilevel optimization, 2024

Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization, 2024

work page 2024

[25] [25]

Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

work page 2019

[26] [26]

Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning, 2015

work page 2015

[27] [27]

Stochastic regret guarantees for online zeroth- and first-order bilevel optimization

Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, and George Michailidis. Stochastic regret guarantees for online zeroth- and first-order bilevel optimization. OpenReview, 2025. Available at https: //openreview.net/forum?id=nYrj9ZwgYk

work page 2025

[28] [28]

On first-order meta-learning algorithms, 2018

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms, 2018

work page 2018

[29] [29]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR

work page 2016

[30] [30]

Hyperparameter optimization with approximate gradient, 2022

Fabian Pedregosa. Hyperparameter optimization with approximate gradient, 2022

work page 2022

[31] [31]

Truncated back-propagation for bilevel optimization, 2019

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization, 2019

work page 2019

[32] [32]

On penalty-based bilevel gradient descent method, 2025

Han Shen, Quan Xiao, and Tianyi Chen. On penalty-based bilevel gradient descent method, 2025

work page 2025

[33] [33]

Es-maml: Simple hessian-free meta learning, 2020

Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning, 2020

work page 2020

[34] [34]

A primal-dual approach to bilevel optimization with multiple inner minima, 2022

Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A primal-dual approach to bilevel optimization with multiple inner minima, 2022

work page 2022

[35] [35]

On the convergence theory for hessian-free bilevel algorithms, 2022

Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms, 2022

work page 2022

[36] [36]

Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

Davoud Ataee Tarzanagh, Parvin Nazari, Bojian Hou, Li Shen, and Laura Balzano. Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

work page 2024

[37] [37]

Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. Multimodal model-agnostic meta-learning via task-aware modulation, 2019

work page 2019

[38] [38]

Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

work page 2025

[39] [39]

Bome! bilevel optimization made easy: A simple first-order approach, 2022

Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach, 2022. 9 A Preliminary Lemmas We list all the lemmas used in the proof below. Lemma A.1(Lemma 16 in [36]).Under Assumptions 2.1, 2.2, for allt∈[T], anyx 1,x 2 ∈ X, we have ∥y∗ t (x1)−y ∗ t (x2)∥ ≤ Lg,1 µg ∥x1 −x 2∥. Lemma A.2(Lem...

work page 2022