Fully First-Order Algorithms for Online Bilevel Optimization
Pith reviewed 2026-05-16 02:21 UTC · model grok-4.3
The pith
A fully first-order algorithm solves nonconvex-strongly convex online bilevel optimization without Hessian-vector products.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting the original bilevel problem into an equivalent single-level online problem with inequality constraints and constructing Lagrangian functions, the paper obtains a fully first-order algorithm that achieves regret O(1 + V_T + H_{2,T}) with O(T log T) total iterations while avoiding all Hessian-vector product oracles.
What carries the argument
Reformulation of the bilevel problem as a single-level online problem with inequality constraints together with a sequence of Lagrangian functions, which enables first-order updates without implicit differentiation.
If this is right
- The algorithm applies directly to online bilevel tasks where Hessian-vector products are unavailable or prohibitively expensive.
- A single-loop variant yields sublinear regret once gradient-variation terms are included in the analysis.
- An adaptive inner-iteration scheme produces regret O(log T + V_T) independent of inner-solution drift.
- Under stochastic gradients the same first-order framework guarantees regret O(T^{2/3}(1 + σ²) + V_T + H_{2,T}).
- Numerical experiments confirm that the method reaches the stated regret rates on standard bilevel benchmarks.
Where Pith is reading between the lines
- The constraint-based reformulation may allow existing first-order constrained optimizers to be applied to a broader class of bilevel problems.
- The removal of H_{2,T} dependence in the adaptive variant could enable fully online tuning without prior knowledge of solution drift.
- These regret expressions suggest concrete iteration budgets for real-time bilevel learning applications where drift is measurable.
- Similar Lagrangian reformulations might transfer to other online hierarchical optimization settings that currently rely on second-order oracles.
Load-bearing premise
The reformulation of the original bilevel problem into a single-level online problem with inequality constraints must preserve exact equivalence, and the inner-level problem must remain strongly convex at every online step.
What would settle it
Execute the algorithm on an online bilevel instance where the inner optimal solution exhibits large drift (high H_{2,T}) and verify whether cumulative regret stays within O(1 + V_T + H_{2,T}).
read the original abstract
In this work, we study nonconvex-strongly convex online bilevel optimization (OBO) using only first-order oracle. Existing OBO algorithms are mainly based on hypergradient descent, which requires access to a Hessian-vector product (HVP) oracle and potentially incurs high computational costs. By reformulating the original OBO problem as a single-level online problem with inequality constraints and constructing a sequence of Lagrangian function, we eliminate the need for HVPs arising from implicit differentiation. Specifically, we propose a fully first-order algorithm for OBO, and provide theoretical guarantees showing that it achieves regret of $O(1 + V_T + H_{2,T})$ with a total of $O(T\log T)$ iterations, where $V_T$ measures the variation in function values and $H_{2,T}$ characterizes the drift variation of the inner-level optimal solution. We also establish a sublinear regret bound under the single-loop structure by introducing additional gradient-variation terms. Furthermore, we develop an improved variant with an adaptive inner-iteration scheme, which removes the dependence on $H_{2,T}$ and achieves regret of $O(\log T + V_T)$. Finally, under the stochastic OBO setting, we establish the regret bound for the fully first-order algorithm, i.e., $O(T^{2/3}(1 + \sigma^2) + V_T + H_{2,T})$. Numerical experiments demonstrate the feasibility of our algorithm and support our theoretical findings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes fully first-order algorithms for nonconvex-strongly convex online bilevel optimization (OBO) by reformulating the bilevel problem as a single-level online program with inequality constraints derived from the inner optimality condition, then constructing a sequence of Lagrangian functions to eliminate Hessian-vector products. It claims a regret bound of O(1 + V_T + H_{2,T}) using O(T log T) iterations for the base algorithm, a sublinear bound under single-loop structure, an improved adaptive-inner-iteration variant achieving O(log T + V_T), and a stochastic regret of O(T^{2/3}(1 + σ²) + V_T + H_{2,T}), supported by numerical experiments.
Significance. If the reformulation equivalence holds exactly and the regret bounds are rigorously established, the work would be significant for providing practical first-order methods for OBO that avoid costly HVPs, with regret scaling only in natural variation measures V_T and H_{2,T}. The adaptive variant removing H_{2,T} dependence and the stochastic extension are particularly useful strengths; the numerical experiments add empirical support.
major comments (2)
- [Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.
- [Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.
minor comments (2)
- [Introduction] The precise definitions of the variation measures V_T and H_{2,T} should be stated explicitly in the introduction or notation section rather than only in the abstract, to improve readability.
- [Experiments] In the numerical experiments, the description of baselines and hyperparameter choices could be expanded for reproducibility, and error bars or multiple runs should be reported to strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, providing clarifications on the reformulation and theoretical assumptions while committing to revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.
Authors: We thank the referee for highlighting this subtlety. In the reformulation, the inequality constraints are derived directly from the KKT optimality conditions of the inner problem evaluated at the current outer variable x_t at each time step t; they are therefore time-varying and updated dynamically with the outer iterates rather than held static within an inner loop. The sequence of Lagrangian functions is constructed to incorporate these updates, and the analysis explicitly bounds any accumulated constraint violation or duality gap by the inner-solution drift measure H_{2,T}. This ensures equivalence holds in the online sense up to the variation terms already present in the regret bound. We will revise the reformulation section to include an explicit paragraph clarifying the dynamic constraint construction and its relation to H_{2,T}. revision: partial
-
Referee: [Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.
Authors: The inner problem is assumed strongly convex in y with a uniform parameter μ > 0 for any fixed outer variable x, which is the standard setting for nonconvex-strongly-convex bilevel problems. The outer updates induce changes in the inner objective, but these are controlled through the variation measures V_T and H_{2,T} that appear in the regret; the analysis does not require μ itself to vary with time. We agree that an explicit remark on error propagation under the uniform-μ assumption would improve clarity. We will add a short discussion paragraph in the theoretical guarantees section that recalls the uniform strong-convexity assumption and sketches how the existing variation bounds already propagate inner-solver errors without needing a drifting-μ analysis. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The central regret bounds O(1 + V_T + H_{2,T}) and variants are defined directly in terms of externally specified variation measures V_T (function-value variation) and H_{2,T} (inner-optima drift), which are not constructed from fitted parameters or self-referential definitions inside the algorithm. The reformulation of OBO as a single-level online constrained problem is stated as an equivalence via Lagrangian sequence construction; no equation reduces the claimed regret to a tautological renaming of the input data or to a self-citation chain. The analysis proceeds from standard first-order oracle assumptions and online convex optimization primitives without load-bearing self-citations or ansatz smuggling. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Inner-level problem is strongly convex for all outer parameters
- domain assumption Reformulation to single-level constrained problem is equivalent
Reference graph
Works this paper leans on
-
[1]
Non-convex bilevel games with critical point selection maps, 2022
Michael Arbel and Julien Mairal. Non-convex bilevel games with critical point selection maps, 2022
work page 2022
-
[2]
Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026
Jason Bohne, Ieva Petrulionyte, Michael Arbel, Julien Mairal, and Paweł Polak. Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026
-
[3]
Online nonconvex bilevel optimization with bregman divergences, 2024
Jason Bohne, David Rosenberg, Gary Kazantsev, and Pawel Polak. Online nonconvex bilevel optimization with bregman divergences, 2024
work page 2024
-
[4]
Mathematical programs with optimization problems in the constraints
Jerome Bracken and James T McGill. Mathematical programs with optimization problems in the constraints. Operations research, 21(1):37–44, 1973
work page 1973
-
[5]
Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 654–663. PMLR, 2017
work page 2017
-
[6]
Duchi, Oliver Hinder, and Aaron Sidford
Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020
work page 2020
-
[7]
Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023
-
[8]
Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025
work page 2025
-
[9]
Lesi Chen, Jing Xu, and Jingzhao Zhang. On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025
work page 2025
-
[10]
Forward and reverse gradient-based hyperparameter optimization, 2017
Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017
work page 2017
-
[11]
Approximation methods for bilevel programming, 2018
Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming, 2018
work page 2018
-
[12]
Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016
work page 2016
-
[13]
On the iteration complexity of hypergradient computation, 2020
Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation, 2020
work page 2020
-
[14]
Efficient regret minimization in non-convex games, 2017
Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games, 2017. 8
work page 2017
-
[15]
On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023
Feihu Huang. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023
work page 2023
-
[16]
Yu Huang, Yuan Cheng, Yingbin Liang, and Longbo Huang. Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023
work page 2023
-
[17]
Will bilevel optimizers benefit from loops, 2022
Kaiyi Ji, Mingrui Liu, Yingbin Liang, and Lei Ying. Will bilevel optimizers benefit from loops, 2022
work page 2022
-
[18]
Bilevel optimization: Convergence analysis and enhanced design, 2021
Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design, 2021
work page 2021
-
[19]
Achieving better local regret bound for online non-convex bilevel optimization, 2026
Tingkai Jia, Haiguang Wang, and Cheng Chen. Achieving better local regret bound for online non-convex bilevel optimization, 2026
work page 2026
-
[20]
A fully first-order method for stochastic bilevel optimization, 2023
Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowak. A fully first-order method for stochastic bilevel optimization, 2023
work page 2023
-
[21]
Non-convex bilevel optimization with time- varying objective functions, 2023
Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time- varying objective functions, 2023
work page 2023
-
[22]
Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021
Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021
work page 2021
-
[23]
A first-order augmented lagrangian method for constrained minimax optimization, 2024
Zhaosong Lu and Sanyou Mei. A first-order augmented lagrangian method for constrained minimax optimization, 2024
work page 2024
-
[24]
First-order penalty methods for bilevel optimization, 2024
Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization, 2024
work page 2024
-
[25]
Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019
work page 2019
-
[26]
Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning, 2015
work page 2015
-
[27]
Stochastic regret guarantees for online zeroth- and first-order bilevel optimization
Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, and George Michailidis. Stochastic regret guarantees for online zeroth- and first-order bilevel optimization. OpenReview, 2025. Available at https: //openreview.net/forum?id=nYrj9ZwgYk
work page 2025
-
[28]
On first-order meta-learning algorithms, 2018
Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms, 2018
work page 2018
-
[29]
Hyperparameter optimization with approximate gradient
Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR
work page 2016
-
[30]
Hyperparameter optimization with approximate gradient, 2022
Fabian Pedregosa. Hyperparameter optimization with approximate gradient, 2022
work page 2022
-
[31]
Truncated back-propagation for bilevel optimization, 2019
Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization, 2019
work page 2019
-
[32]
On penalty-based bilevel gradient descent method, 2025
Han Shen, Quan Xiao, and Tianyi Chen. On penalty-based bilevel gradient descent method, 2025
work page 2025
-
[33]
Es-maml: Simple hessian-free meta learning, 2020
Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning, 2020
work page 2020
-
[34]
A primal-dual approach to bilevel optimization with multiple inner minima, 2022
Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A primal-dual approach to bilevel optimization with multiple inner minima, 2022
work page 2022
-
[35]
On the convergence theory for hessian-free bilevel algorithms, 2022
Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms, 2022
work page 2022
-
[36]
Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024
Davoud Ataee Tarzanagh, Parvin Nazari, Bojian Hou, Li Shen, and Laura Balzano. Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024
work page 2024
-
[37]
Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. Multimodal model-agnostic meta-learning via task-aware modulation, 2019
work page 2019
-
[38]
Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025
Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025
work page 2025
-
[39]
Bome! bilevel optimization made easy: A simple first-order approach, 2022
Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach, 2022. 9 A Preliminary Lemmas We list all the lemmas used in the proof below. Lemma A.1(Lemma 16 in [36]).Under Assumptions 2.1, 2.2, for allt∈[T], anyx 1,x 2 ∈ X, we have ∥y∗ t (x1)−y ∗ t (x2)∥ ≤ Lg,1 µg ∥x1 −x 2∥. Lemma A.2(Lem...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.