pith. sign in

arxiv: 2602.11665 · v2 · submitted 2026-02-12 · 💻 cs.LG · math.OC

Fully First-Order Algorithms for Online Bilevel Optimization

Pith reviewed 2026-05-16 02:21 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords online bilevel optimizationfirst-order methodsregret boundsnonconvex-strongly convexLagrangian methodshypergradient avoidancestochastic bilevel optimization
0
0 comments X

The pith

A fully first-order algorithm solves nonconvex-strongly convex online bilevel optimization without Hessian-vector products.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops algorithms for online bilevel optimization where the outer problem is nonconvex and the inner problem is strongly convex, using only first-order information. It reformulates the bilevel problem as a single-level online optimization task with inequality constraints and builds a sequence of Lagrangian functions to bypass hypergradients and Hessian-vector products. The resulting method delivers regret O(1 + V_T + H_{2,T}) across O(T log T) iterations, where V_T captures changes in function values and H_{2,T} captures drift in the inner optimal solution. Variants include a single-loop version with sublinear regret and an adaptive scheme that removes dependence on H_{2,T} to reach O(log T + V_T). The same first-order approach also yields a regret bound O(T^{2/3}(1 + σ²) + V_T + H_{2,T}) under stochastic noise.

Core claim

By converting the original bilevel problem into an equivalent single-level online problem with inequality constraints and constructing Lagrangian functions, the paper obtains a fully first-order algorithm that achieves regret O(1 + V_T + H_{2,T}) with O(T log T) total iterations while avoiding all Hessian-vector product oracles.

What carries the argument

Reformulation of the bilevel problem as a single-level online problem with inequality constraints together with a sequence of Lagrangian functions, which enables first-order updates without implicit differentiation.

If this is right

  • The algorithm applies directly to online bilevel tasks where Hessian-vector products are unavailable or prohibitively expensive.
  • A single-loop variant yields sublinear regret once gradient-variation terms are included in the analysis.
  • An adaptive inner-iteration scheme produces regret O(log T + V_T) independent of inner-solution drift.
  • Under stochastic gradients the same first-order framework guarantees regret O(T^{2/3}(1 + σ²) + V_T + H_{2,T}).
  • Numerical experiments confirm that the method reaches the stated regret rates on standard bilevel benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The constraint-based reformulation may allow existing first-order constrained optimizers to be applied to a broader class of bilevel problems.
  • The removal of H_{2,T} dependence in the adaptive variant could enable fully online tuning without prior knowledge of solution drift.
  • These regret expressions suggest concrete iteration budgets for real-time bilevel learning applications where drift is measurable.
  • Similar Lagrangian reformulations might transfer to other online hierarchical optimization settings that currently rely on second-order oracles.

Load-bearing premise

The reformulation of the original bilevel problem into a single-level online problem with inequality constraints must preserve exact equivalence, and the inner-level problem must remain strongly convex at every online step.

What would settle it

Execute the algorithm on an online bilevel instance where the inner optimal solution exhibits large drift (high H_{2,T}) and verify whether cumulative regret stays within O(1 + V_T + H_{2,T}).

read the original abstract

In this work, we study nonconvex-strongly convex online bilevel optimization (OBO) using only first-order oracle. Existing OBO algorithms are mainly based on hypergradient descent, which requires access to a Hessian-vector product (HVP) oracle and potentially incurs high computational costs. By reformulating the original OBO problem as a single-level online problem with inequality constraints and constructing a sequence of Lagrangian function, we eliminate the need for HVPs arising from implicit differentiation. Specifically, we propose a fully first-order algorithm for OBO, and provide theoretical guarantees showing that it achieves regret of $O(1 + V_T + H_{2,T})$ with a total of $O(T\log T)$ iterations, where $V_T$ measures the variation in function values and $H_{2,T}$ characterizes the drift variation of the inner-level optimal solution. We also establish a sublinear regret bound under the single-loop structure by introducing additional gradient-variation terms. Furthermore, we develop an improved variant with an adaptive inner-iteration scheme, which removes the dependence on $H_{2,T}$ and achieves regret of $O(\log T + V_T)$. Finally, under the stochastic OBO setting, we establish the regret bound for the fully first-order algorithm, i.e., $O(T^{2/3}(1 + \sigma^2) + V_T + H_{2,T})$. Numerical experiments demonstrate the feasibility of our algorithm and support our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes fully first-order algorithms for nonconvex-strongly convex online bilevel optimization (OBO) by reformulating the bilevel problem as a single-level online program with inequality constraints derived from the inner optimality condition, then constructing a sequence of Lagrangian functions to eliminate Hessian-vector products. It claims a regret bound of O(1 + V_T + H_{2,T}) using O(T log T) iterations for the base algorithm, a sublinear bound under single-loop structure, an improved adaptive-inner-iteration variant achieving O(log T + V_T), and a stochastic regret of O(T^{2/3}(1 + σ²) + V_T + H_{2,T}), supported by numerical experiments.

Significance. If the reformulation equivalence holds exactly and the regret bounds are rigorously established, the work would be significant for providing practical first-order methods for OBO that avoid costly HVPs, with regret scaling only in natural variation measures V_T and H_{2,T}. The adaptive variant removing H_{2,T} dependence and the stochastic extension are particularly useful strengths; the numerical experiments add empirical support.

major comments (2)
  1. [Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.
  2. [Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.
minor comments (2)
  1. [Introduction] The precise definitions of the variation measures V_T and H_{2,T} should be stated explicitly in the introduction or notation section rather than only in the abstract, to improve readability.
  2. [Experiments] In the numerical experiments, the description of baselines and hyperparameter choices could be expanded for reproducibility, and error bars or multiple runs should be reported to strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, providing clarifications on the reformulation and theoretical assumptions while committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Reformulation and main algorithm] Reformulation section (as described in the abstract): the central claim that the inequality-constrained single-level problem is exactly equivalent to the original OBO at every round relies on the constraints capturing the drifting inner optimum without gap; however, when the inner solution varies (H_{2,T} term), treating the constraint set as static within each inner loop risks introducing a growing duality gap or violation that would invalidate the subsequent first-order Lagrangian updates and the O(1 + V_T + H_{2,T}) regret.

    Authors: We thank the referee for highlighting this subtlety. In the reformulation, the inequality constraints are derived directly from the KKT optimality conditions of the inner problem evaluated at the current outer variable x_t at each time step t; they are therefore time-varying and updated dynamically with the outer iterates rather than held static within an inner loop. The sequence of Lagrangian functions is constructed to incorporate these updates, and the analysis explicitly bounds any accumulated constraint violation or duality gap by the inner-solution drift measure H_{2,T}. This ensures equivalence holds in the online sense up to the variation terms already present in the regret bound. We will revise the reformulation section to include an explicit paragraph clarifying the dynamic constraint construction and its relation to H_{2,T}. revision: partial

  2. Referee: [Theoretical analysis] Theoretical guarantees section: the regret bound O(1 + V_T + H_{2,T}) and the assumption that the inner-level problem remains strongly convex throughout the online process are load-bearing, yet no explicit verification, robustness margin, or error propagation analysis is provided for cases where the strong-convexity parameter drifts with the outer updates.

    Authors: The inner problem is assumed strongly convex in y with a uniform parameter μ > 0 for any fixed outer variable x, which is the standard setting for nonconvex-strongly-convex bilevel problems. The outer updates induce changes in the inner objective, but these are controlled through the variation measures V_T and H_{2,T} that appear in the regret; the analysis does not require μ itself to vary with time. We agree that an explicit remark on error propagation under the uniform-μ assumption would improve clarity. We will add a short discussion paragraph in the theoretical guarantees section that recalls the uniform strong-convexity assumption and sketches how the existing variation bounds already propagate inner-solver errors without needing a drifting-μ analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central regret bounds O(1 + V_T + H_{2,T}) and variants are defined directly in terms of externally specified variation measures V_T (function-value variation) and H_{2,T} (inner-optima drift), which are not constructed from fitted parameters or self-referential definitions inside the algorithm. The reformulation of OBO as a single-level online constrained problem is stated as an equivalence via Lagrangian sequence construction; no equation reduces the claimed regret to a tautological renaming of the input data or to a self-citation chain. The analysis proceeds from standard first-order oracle assumptions and online convex optimization primitives without load-bearing self-citations or ansatz smuggling. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions for bilevel optimization and online regret analysis; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Inner-level problem is strongly convex for all outer parameters
    Required to ensure unique inner solutions and to control drift term H_{2,T}.
  • domain assumption Reformulation to single-level constrained problem is equivalent
    Invoked to justify applying first-order methods without implicit differentiation.

pith-pipeline@v0.9.0 · 5558 in / 1249 out tokens · 33078 ms · 2026-05-16T02:21:14.231389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Non-convex bilevel games with critical point selection maps, 2022

    Michael Arbel and Julien Mairal. Non-convex bilevel games with critical point selection maps, 2022

  2. [2]

    Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

    Jason Bohne, Ieva Petrulionyte, Michael Arbel, Julien Mairal, and Paweł Polak. Non-stationary functional bilevel optimization.arXiv preprint arXiv:2601.15363, 2026

  3. [3]

    Online nonconvex bilevel optimization with bregman divergences, 2024

    Jason Bohne, David Rosenberg, Gary Kazantsev, and Pawel Polak. Online nonconvex bilevel optimization with bregman divergences, 2024

  4. [4]

    Mathematical programs with optimization problems in the constraints

    Jerome Bracken and James T McGill. Mathematical programs with optimization problems in the constraints. Operations research, 21(1):37–44, 1973

  5. [5]

    convex until proven guilty

    Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 654–663. PMLR, 2017

  6. [6]

    Duchi, Oliver Hinder, and Aaron Sidford

    Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020

  7. [7]

    Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

    Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.arXiv preprint arXiv:2306.14853, 2023

  8. [8]

    Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

    Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal nonconvex-strongly-convex bilevel optimization with fully first-order oracles.Journal of Machine Learning Research, 26(109):1–56, 2025

  9. [9]

    On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

    Lesi Chen, Jing Xu, and Jingzhao Zhang. On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis, 2025

  10. [10]

    Forward and reverse gradient-based hyperparameter optimization, 2017

    Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization, 2017

  11. [11]

    Approximation methods for bilevel programming, 2018

    Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming, 2018

  12. [12]

    On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

    Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization, 2016

  13. [13]

    On the iteration complexity of hypergradient computation, 2020

    Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation, 2020

  14. [14]

    Efficient regret minimization in non-convex games, 2017

    Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games, 2017. 8

  15. [15]

    On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

    Feihu Huang. On momentum-based gradient methods for bilevel optimization with nonconvex lower-level, 2023

  16. [16]

    Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

    Yu Huang, Yuan Cheng, Yingbin Liang, and Longbo Huang. Online min-max problems with non-convexity and non-stationarity.Transactions on Machine Learning Research, 2023

  17. [17]

    Will bilevel optimizers benefit from loops, 2022

    Kaiyi Ji, Mingrui Liu, Yingbin Liang, and Lei Ying. Will bilevel optimizers benefit from loops, 2022

  18. [18]

    Bilevel optimization: Convergence analysis and enhanced design, 2021

    Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design, 2021

  19. [19]

    Achieving better local regret bound for online non-convex bilevel optimization, 2026

    Tingkai Jia, Haiguang Wang, and Cheng Chen. Achieving better local regret bound for online non-convex bilevel optimization, 2026

  20. [20]

    A fully first-order method for stochastic bilevel optimization, 2023

    Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowak. A fully first-order method for stochastic bilevel optimization, 2023

  21. [21]

    Non-convex bilevel optimization with time- varying objective functions, 2023

    Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time- varying objective functions, 2023

  22. [22]

    Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

    Risheng Liu, Yaohua Liu, Shangzhi Zeng, and Jin Zhang. Towards gradient-based bilevel optimization with non-convex followers and beyond, 2021

  23. [23]

    A first-order augmented lagrangian method for constrained minimax optimization, 2024

    Zhaosong Lu and Sanyou Mei. A first-order augmented lagrangian method for constrained minimax optimization, 2024

  24. [24]

    First-order penalty methods for bilevel optimization, 2024

    Zhaosong Lu and Sanyou Mei. First-order penalty methods for bilevel optimization, 2024

  25. [25]

    Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

    Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions, 2019

  26. [26]

    Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter optimization through reversible learning, 2015

  27. [27]

    Stochastic regret guarantees for online zeroth- and first-order bilevel optimization

    Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, and George Michailidis. Stochastic regret guarantees for online zeroth- and first-order bilevel optimization. OpenReview, 2025. Available at https: //openreview.net/forum?id=nYrj9ZwgYk

  28. [28]

    On first-order meta-learning algorithms, 2018

    Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms, 2018

  29. [29]

    Hyperparameter optimization with approximate gradient

    Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746, New York, New York, USA, 20–22 Jun 2016. PMLR

  30. [30]

    Hyperparameter optimization with approximate gradient, 2022

    Fabian Pedregosa. Hyperparameter optimization with approximate gradient, 2022

  31. [31]

    Truncated back-propagation for bilevel optimization, 2019

    Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization, 2019

  32. [32]

    On penalty-based bilevel gradient descent method, 2025

    Han Shen, Quan Xiao, and Tianyi Chen. On penalty-based bilevel gradient descent method, 2025

  33. [33]

    Es-maml: Simple hessian-free meta learning, 2020

    Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning, 2020

  34. [34]

    A primal-dual approach to bilevel optimization with multiple inner minima, 2022

    Daouda Sow, Kaiyi Ji, Ziwei Guan, and Yingbin Liang. A primal-dual approach to bilevel optimization with multiple inner minima, 2022

  35. [35]

    On the convergence theory for hessian-free bilevel algorithms, 2022

    Daouda Sow, Kaiyi Ji, and Yingbin Liang. On the convergence theory for hessian-free bilevel algorithms, 2022

  36. [36]

    Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

    Davoud Ataee Tarzanagh, Parvin Nazari, Bojian Hou, Li Shen, and Laura Balzano. Online bilevel optimization: Regret analysis of online alternating gradient methods, 2024

  37. [37]

    Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. Multimodal model-agnostic meta-learning via task-aware modulation, 2019

  38. [38]

    Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

    Yifan Yang, Peiyao Xiao, and Kaiyi Ji. Achieving O(ϵ−1.5) complexity in hessian/jacobian-free stochastic bilevel optimization, 2025

  39. [39]

    Bome! bilevel optimization made easy: A simple first-order approach, 2022

    Mao Ye, Bo Liu, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach, 2022. 9 A Preliminary Lemmas We list all the lemmas used in the proof below. Lemma A.1(Lemma 16 in [36]).Under Assumptions 2.1, 2.2, for allt∈[T], anyx 1,x 2 ∈ X, we have ∥y∗ t (x1)−y ∗ t (x2)∥ ≤ Lg,1 µg ∥x1 −x 2∥. Lemma A.2(Lem...