Continuous-Time Analysis for Minimax and Bilevel Problems

Dohyun Kwon; Hyunwoo Lee; Jeongyeol Kwon

arxiv: 2605.20898 · v1 · pith:IJOUTAHCnew · submitted 2026-05-20 · 🧮 math.OC

Continuous-Time Analysis for Minimax and Bilevel Problems

Hyunwoo Lee , Jeongyeol Kwon , Dohyun Kwon This is my paper

Pith reviewed 2026-05-21 03:42 UTC · model grok-4.3

classification 🧮 math.OC

keywords continuous-time analysisLyapunov templateminimax optimizationbilevel optimizationgradient flowtime-scale separationerror-bound conditionnested optimization

0 comments

The pith

A unified Lyapunov template analyzes continuous-time dynamics for minimax, bilevel, and min-min-max problems using an error-bound condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified Lyapunov template for continuous-time analysis of single-loop gradient-flow dynamics in nested optimization settings. This template covers minimax problems, bilevel optimization through a lifted penalty formulation, and min-min-max problems. It uses modular reusable lemmas to characterize time-scale separation across different convexity regimes via an error-bound condition, producing explicit closed-form thresholds. A sympathetic reader would care because this approach avoids the need for problem-specific Lyapunov functions and the coupled ratio conditions typical in discrete analyses, potentially simplifying theoretical and practical work on these problems.

Core claim

The authors propose the first unified Lyapunov template for continuous-time analysis that covers minimax, bilevel via a lifted penalty formulation, and min-min-max. The modular proof built from reusable lemmas yields a unified characterization of time-scale separation that bridges regimes from strong convexity/concavity to mere convexity through an error-bound condition, and produces explicit closed-form thresholds that avoid coupled ratio conditions common in discrete-time analyses. The paper further compares the penalty dynamics with the ideal hyper-gradient flow, derives a finite-time tracking bound, and discusses an Euler one-step analogue.

What carries the argument

The unified Lyapunov template constructed from reusable lemmas that characterize time-scale separation via an error-bound condition.

If this is right

The analysis applies uniformly to minimax, bilevel, and min-min-max problems.
Explicit closed-form thresholds for time-scale separation are obtained without coupled ratios.
The penalty dynamics admit a finite-time tracking bound relative to the ideal hyper-gradient flow.
Stable forward-Euler discretization preserves the predicted relative time-scale regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular lemmas might be adapted to analyze other nested problems like tri-level optimization.
Connecting the continuous-time results to discrete implementations could help design step-size rules that respect the derived thresholds.
This framework suggests that error-bound conditions can serve as a general tool for handling varying convexity strengths in optimization dynamics.

Load-bearing premise

The reusable lemmas and error-bound condition suffice to characterize time-scale separation across convexity regimes without requiring problem-specific Lyapunov constructions or coupled ratio conditions.

What would settle it

A numerical simulation of the gradient-flow dynamics on a bilevel problem where the error-bound condition is satisfied but the observed convergence fails to match the predicted explicit time-scale threshold from the unified template.

Figures

Figures reproduced from arXiv: 2605.20898 by Dohyun Kwon, Hyunwoo Lee, Jeongyeol Kwon.

**Figure 2.** Figure 2: Tracking error between the single-loop lifted-penalty trajectory [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗

read the original abstract

We study single-loop gradient-flow dynamics for nested optimization, where the outer variable evolves while auxiliary variables track the inner solution map. While existing analyses typically rely on problem- and condition-specific Lyapunov constructions, we propose, to our knowledge, the first unified Lyapunov template for continuous-time analysis that covers minimax, bilevel via a lifted penalty formulation, and min--min--max. Our proof is modular, built from reusable lemmas that yield a unified characterization of time-scale separation. This characterization bridges regimes from strong convexity/concavity to mere convexity through an error-bound condition, and produces explicit closed-form thresholds that avoid the coupled ratio conditions common in discrete-time analyses. We further compare the penalty dynamics with the ideal hyper-gradient flow, derive a finite-time tracking bound, and discuss an Euler one-step analogue; hypercleaning diagnostics show that the predicted relative time-scale regions remain visible under stable forward-Euler discretization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a modular unified Lyapunov template for continuous-time nested optimization but the bilevel penalty may still introduce some scale coupling in convex cases.

read the letter

The main takeaway is that they have put together a single Lyapunov template that works for minimax, bilevel with a lifted penalty, and min-min-max in continuous time. The proofs are built from reusable lemmas that use an error-bound condition to get explicit closed-form thresholds for time-scale separation, covering strong to weak convexity without the coupled ratios you often see in discrete analyses. They do well on the comparison to the ideal hyper-gradient flow, with a finite-time tracking bound, and on checking the Euler discretization with those hypercleaning diagnostics to see if the regions hold up numerically. That adds some practical verification. The soft spot is the one flagged in the stress test. For the bilevel via lifted penalty, when the inner problem is only convex, the error bound might hold only for penalties that grow with the separation, which could mean you still end up with some dependence between parameters rather than fully modular choices. The paper claims the lemmas suffice across regimes, so if the conditions are stated without that scaling, it would address the concern directly. Otherwise it limits how unified the template really is. This is for optimization theorists working on continuous-time dynamics for nested problems. Someone who wants a reusable structure instead of custom Lyapunov functions for each setting would get the most out of it. It deserves a serious referee because the technical claims are specific enough to check against the lemmas and conditions. I recommend sending it to peer review, with attention to whether the penalty parameter remains independent in the convex bilevel case.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified Lyapunov template for the continuous-time analysis of single-loop gradient-flow dynamics applied to nested optimization problems. It covers minimax problems, bilevel optimization via a lifted penalty formulation, and min-min-max problems. The proof strategy is modular, relying on reusable lemmas together with an error-bound condition to characterize time-scale separation across regimes ranging from strong convexity/concavity to mere convexity. The template is asserted to deliver explicit closed-form thresholds that avoid the coupled ratio conditions typical of discrete-time analyses. Additional results include a comparison of the penalty dynamics against ideal hyper-gradient flow, a finite-time tracking bound, and a discussion of an Euler discretization with hypercleaning diagnostics.

Significance. If the central claims hold, the work would offer a meaningful contribution to the continuous-time analysis of bilevel and minimax problems by replacing problem-specific Lyapunov constructions with a reusable modular template. The explicit closed-form time-scale thresholds and the bridging of convexity regimes via error bounds could simplify future analyses and extend more readily across problem classes. The modular lemmas constitute a clear strength in reusability, and the comparison with hyper-gradient flow plus the discretization discussion add practical value.

major comments (2)

[§4.2] §4.2 (lifted penalty formulation for bilevel): the claim that the reusable lemmas and error-bound condition suffice to produce explicit closed-form time-scale thresholds without coupled ratios is load-bearing for the unification result. For merely convex inner problems the approximation error controlled by the penalty parameter λ typically requires λ to scale with the separation parameter (e.g., λ ≫ 1/ε) to keep the error-bound sufficiently tight; this dependence appears to reintroduce a form of problem-specific tuning between λ and the outer flow that the template is asserted to avoid.
[§3] §3 (unified Lyapunov template): the statement that the error-bound condition bridges all convexity regimes without requiring additional problem-specific assumptions is central to the modularity claim. It is not yet clear from the derivation whether the error-bound constant itself remains independent of the time-scale separation parameter when the inner problem is only convex, or whether hidden dependence enters through the choice of auxiliary dynamics.

minor comments (2)

[Abstract and §6] The term 'hypercleaning diagnostics' is used in the abstract and §6 without a concise definition or reference on first appearance; a short parenthetical explanation would improve readability for readers outside the immediate subfield.
[§3 and §4] Notation for the error-bound constant (denoted variously as μ or θ in different lemmas) should be unified across §3 and §4 to avoid confusion when the same symbol is reused for distinct quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about the independence of the error-bound condition and the avoidance of coupled tuning in the bilevel penalty case. We address each major comment below and have revised the manuscript to strengthen the exposition and clarify these aspects.

read point-by-point responses

Referee: [§4.2] §4.2 (lifted penalty formulation for bilevel): the claim that the reusable lemmas and error-bound condition suffice to produce explicit closed-form time-scale thresholds without coupled ratios is load-bearing for the unification result. For merely convex inner problems the approximation error controlled by the penalty parameter λ typically requires λ to scale with the separation parameter (e.g., λ ≫ 1/ε) to keep the error-bound sufficiently tight; this dependence appears to reintroduce a form of problem-specific tuning between λ and the outer flow that the template is asserted to avoid.

Authors: We thank the referee for this observation. In the lifted penalty formulation, λ is selected first to achieve a target approximation accuracy to the inner solution map (via the error-bound condition), after which the time-scale separation threshold is derived explicitly from the reusable lemmas. The lemmas are stated so that the resulting closed-form threshold depends on problem constants (including those arising from λ) but does not impose a coupled ratio between λ and the separation parameter ε. For merely convex inner problems the error bound remains valid once λ is fixed at a sufficient value; the separation condition is then chosen independently to dominate the remaining terms. We have added a clarifying remark and an explicit statement of the threshold in §4.2 to make this decoupling explicit. revision: yes
Referee: [§3] §3 (unified Lyapunov template): the statement that the error-bound condition bridges all convexity regimes without requiring additional problem-specific assumptions is central to the modularity claim. It is not yet clear from the derivation whether the error-bound constant itself remains independent of the time-scale separation parameter when the inner problem is only convex, or whether hidden dependence enters through the choice of auxiliary dynamics.

Authors: The error-bound condition is formulated as a property of the inner problem and the chosen auxiliary dynamics, with its constant depending only on problem data (Lipschitz constants, strong-convexity parameters when present, etc.) and not on the outer time-scale separation. In the modular lemmas of §3 the auxiliary dynamics are fixed independently of the separation parameter; the Lyapunov analysis then produces an explicit threshold that absorbs the error-bound term without introducing hidden dependence on the separation ratio. When the inner problem is merely convex the same structure holds, with the error-bound constant remaining independent of the separation. We have revised the derivation in §3 to state this independence explicitly and added a short discussion of the auxiliary-dynamics choice to preclude any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: unified Lyapunov template rests on standard reusable lemmas and error bounds

full rationale

The paper proposes a modular proof built from reusable lemmas that characterize time-scale separation via an error-bound condition, producing explicit closed-form thresholds across convexity regimes for minimax, lifted-penalty bilevel, and min-min-max dynamics. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain; the central template is presented as independent of problem-specific Lyapunov constructions and avoids coupled ratios by design. The derivation is self-contained against external Lyapunov and error-bound ideas, with no quoted equations showing that a claimed prediction equals its input by definition or that uniqueness is imported solely from prior author work. This is the expected non-finding for a paper whose claims rest on standard analysis techniques rather than internal fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard mathematical assumptions from optimization and dynamical systems; no free parameters, ad-hoc axioms, or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Error-bound condition bridges strong convexity/concavity to mere convexity
Abstract states this condition enables the unified characterization of time-scale separation.

pith-pipeline@v0.9.0 · 5682 in / 1116 out tokens · 33611 ms · 2026-05-21T03:42:24.327377+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose... first unified Lyapunov template... W:= Φ(x) +αHy(·) +βHz(·)... deviation-to-residual conversion... time-scale absorption rule... error-bound condition
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bridges regimes from strong convexity/concavity to mere convexity through an error-bound condition... explicit closed-form thresholds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Vivek S. Borkar. Stochastic approximation with two time scales.Systems & Control Letters, 1997

work page 1997
[2]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021
[3]

A single-timescale method for stochastic bilevel optimization

Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 2466–2488. PMLR, 2022

work page 2022
[4]

Saddle-point dynamics: conditions for asymptotic stability of saddle points.SIAM Journal on Control and Optimization, 55(1): 486–511, 2017

Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points.SIAM Journal on Control and Optimization, 55(1): 486–511, 2017

work page 2017
[5]

An overview of bilevel optimization

Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007

work page 2007
[6]

Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems

Thinh Doan. Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems. InLearning for Dynamics and Control Conference, pages 192–206. PMLR, 2022

work page 2022
[7]

Generic methods for optimization-based modeling

Justin Domke. Generic methods for optimization-based modeling. InProceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012

work page 2012
[8]

Stability of primal–dual gradient dynamics and applications to network optimization.Automatica, 46(12):1974–1981, 2010

Diego Feijer and Fernando Paganini. Stability of primal–dual gradient dynamics and applications to network optimization.Automatica, 46(12):1974–1981, 2010. doi: 10.1016/j.automatica.2010. 08.011

work page doi:10.1016/j.automatica.2010 1974
[9]

Bilevel programming for hyperparameter optimization and meta-learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018
[10]

Approximation Methods for Bilevel Programming

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020
[12]

A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic.arXiv preprint arXiv:2007.05170, 2020

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic.arXiv preprint arXiv:2007.05170, 2020

work page arXiv 2007
[13]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM Journal on Optimization, 33(1):147–180, 2023

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM Journal on Optimization, 33(1):147–180, 2023

work page 2023
[14]

A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021

Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021. 10

work page 2021
[15]

Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

work page 1999
[16]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, pages 18083–18113. PMLR, 2023

work page 2023
[17]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[18]

Error bounds, PL condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods

Feng-Yi Liao, Lijun Ding, and Yang Zheng. Error bounds, PL condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods. In Alessandro Abate, Mark Cannon, Kostas Margellos, and Antonis Papachristodoulou, editors,Proceedings of the 6th Annual Learning for Dynamics and Control Conference, volume 242 ofProcee...

work page 2024
[19]

Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization.Journal of Machine Learning Research, 26(11):1–45, 2025

Tianyi Lin, Chi Jin, and Michael I Jordan. Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization.Journal of Machine Learning Research, 26(11):1–45, 2025

work page 2025
[20]

Gradient-based hyperparameter opti- mization through reversible learning

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter opti- mization through reversible learning. InProceedings of the 32nd International Conference on Machine Learning (ICML), 2015

work page 2015
[21]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. InProceedings of the 33rd International Conference on Machine Learning (ICML), 2016

work page 2016
[22]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nisha Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019

work page 2019
[23]

Two-timescale stochastic gradient descent in continuous time with applications to joint online parameter estimation and optimal sensor placement

Louis Sharrock and Nikolas Kantas. Two-timescale stochastic gradient descent in continuous time with applications to joint online parameter estimation and optimal sensor placement. Bernoulli, 29(2):1137–1165, 2023

work page 2023
[24]

Theory of the market economy

Heinrich von Stackelberg et al. Theory of the market economy. 1952

work page 1952
[25]

A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

work page 2016
[26]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[27]

Fednest: Federated bilevel, minimax, and compositional optimization

Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated bilevel, minimax, and compositional optimization. InInternational Conference on Machine Learning, pages 21146–21179. PMLR, 2022

work page 2022
[28]

inverse-Hessian times∇ yf

Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670–13682, 2021. 11 A Auxiliary Lemmas A.1 Instantiation Dictionary for the Unified Lemmas We list the concrete objects (Dy, Dz, ry, rz, τy, τz) in Lemmas 3.1–3.2 forP1–P3. In each case, Dy, Dz are the x-gr...

work page 2021
[29]

There exist constants µf , µg >0 such that for every x∈R dx, the mappings y7→f(x, y) andz7→g(x, z)areµ f – andµ g–strongly convex, respectively

work page
[30]

There exist constantsL f x, Lgx ≥0such that for allxand ally, y ′, z, z′, ∥∇xf(x, y)−∇ xf(x, y ′)∥ ≤L f x∥y−y ′∥,∥∇ xg(x, z)−∇ xg(x, z′)∥ ≤L gx∥z−z ′∥

work page
[31]

AssumeL inf := inf x L(x)>−∞

Letf ∗(x) := miny f(x, y),g ∗(x) := minz g(x, z), and define L(x) :=f ∗(x)−g ∗(x). AssumeL inf := inf x L(x)>−∞. Problem Setting.We consider min x∈Rdx min y∈Rdy max z∈Rdz E(x, y, z), E(x, y, z) :=f(x, y)−g(x, z).(83) For each fixedx, min y max z E(x, y, z) = min y f(x, y)−min z g(x, z) =f ∗(x)−g ∗(x) =L(x). We study the gradient flow associated with (83):...

work page

[1] [1]

Vivek S. Borkar. Stochastic approximation with two time scales.Systems & Control Letters, 1997

work page 1997

[2] [2]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021

work page 2021

[3] [3]

A single-timescale method for stochastic bilevel optimization

Tianyi Chen, Yuejiao Sun, Quan Xiao, and Wotao Yin. A single-timescale method for stochastic bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, pages 2466–2488. PMLR, 2022

work page 2022

[4] [4]

Saddle-point dynamics: conditions for asymptotic stability of saddle points.SIAM Journal on Control and Optimization, 55(1): 486–511, 2017

Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: conditions for asymptotic stability of saddle points.SIAM Journal on Control and Optimization, 55(1): 486–511, 2017

work page 2017

[5] [5]

An overview of bilevel optimization

Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235–256, 2007

work page 2007

[6] [6]

Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems

Thinh Doan. Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems. InLearning for Dynamics and Control Conference, pages 192–206. PMLR, 2022

work page 2022

[7] [7]

Generic methods for optimization-based modeling

Justin Domke. Generic methods for optimization-based modeling. InProceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012

work page 2012

[8] [8]

Stability of primal–dual gradient dynamics and applications to network optimization.Automatica, 46(12):1974–1981, 2010

Diego Feijer and Fernando Paganini. Stability of primal–dual gradient dynamics and applications to network optimization.Automatica, 46(12):1974–1981, 2010. doi: 10.1016/j.automatica.2010. 08.011

work page doi:10.1016/j.automatica.2010 1974

[9] [9]

Bilevel programming for hyperparameter optimization and meta-learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018

[10] [10]

Approximation Methods for Bilevel Programming

Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020

[12] [12]

A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic.arXiv preprint arXiv:2007.05170, 2020

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic.arXiv preprint arXiv:2007.05170, 2020

work page arXiv 2007

[13] [13]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM Journal on Optimization, 33(1):147–180, 2023

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM Journal on Optimization, 33(1):147–180, 2023

work page 2023

[14] [14]

A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021

Prashant Khanduri, Siliang Zeng, Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A near-optimal algorithm for stochastic bilevel optimization via double-momentum.Advances in Neural Information Processing Systems, 34:30271–30283, 2021. 10

work page 2021

[15] [15]

Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

work page 1999

[16] [16]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, pages 18083–18113. PMLR, 2023

work page 2023

[17] [17]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[18] [18]

Error bounds, PL condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods

Feng-Yi Liao, Lijun Ding, and Yang Zheng. Error bounds, PL condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods. In Alessandro Abate, Mark Cannon, Kostas Margellos, and Antonis Papachristodoulou, editors,Proceedings of the 6th Annual Learning for Dynamics and Control Conference, volume 242 ofProcee...

work page 2024

[19] [19]

Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization.Journal of Machine Learning Research, 26(11):1–45, 2025

Tianyi Lin, Chi Jin, and Michael I Jordan. Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization.Journal of Machine Learning Research, 26(11):1–45, 2025

work page 2025

[20] [20]

Gradient-based hyperparameter opti- mization through reversible learning

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter opti- mization through reversible learning. InProceedings of the 32nd International Conference on Machine Learning (ICML), 2015

work page 2015

[21] [21]

Hyperparameter optimization with approximate gradient

Fabian Pedregosa. Hyperparameter optimization with approximate gradient. InProceedings of the 33rd International Conference on Machine Learning (ICML), 2016

work page 2016

[22] [22]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nisha Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019

work page 2019

[23] [23]

Two-timescale stochastic gradient descent in continuous time with applications to joint online parameter estimation and optimal sensor placement

Louis Sharrock and Nikolas Kantas. Two-timescale stochastic gradient descent in continuous time with applications to joint online parameter estimation and optimal sensor placement. Bernoulli, 29(2):1137–1165, 2023

work page 2023

[24] [24]

Theory of the market economy

Heinrich von Stackelberg et al. Theory of the market economy. 1952

work page 1952

[25] [25]

A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

work page 2016

[26] [26]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[27] [27]

Fednest: Federated bilevel, minimax, and compositional optimization

Davoud Ataee Tarzanagh, Mingchen Li, Christos Thrampoulidis, and Samet Oymak. Fednest: Federated bilevel, minimax, and compositional optimization. InInternational Conference on Machine Learning, pages 21146–21179. PMLR, 2022

work page 2022

[28] [28]

inverse-Hessian times∇ yf

Junjie Yang, Kaiyi Ji, and Yingbin Liang. Provably faster algorithms for bilevel optimization. Advances in Neural Information Processing Systems, 34:13670–13682, 2021. 11 A Auxiliary Lemmas A.1 Instantiation Dictionary for the Unified Lemmas We list the concrete objects (Dy, Dz, ry, rz, τy, τz) in Lemmas 3.1–3.2 forP1–P3. In each case, Dy, Dz are the x-gr...

work page 2021

[29] [29]

There exist constants µf , µg >0 such that for every x∈R dx, the mappings y7→f(x, y) andz7→g(x, z)areµ f – andµ g–strongly convex, respectively

work page

[30] [30]

There exist constantsL f x, Lgx ≥0such that for allxand ally, y ′, z, z′, ∥∇xf(x, y)−∇ xf(x, y ′)∥ ≤L f x∥y−y ′∥,∥∇ xg(x, z)−∇ xg(x, z′)∥ ≤L gx∥z−z ′∥

work page

[31] [31]

AssumeL inf := inf x L(x)>−∞

Letf ∗(x) := miny f(x, y),g ∗(x) := minz g(x, z), and define L(x) :=f ∗(x)−g ∗(x). AssumeL inf := inf x L(x)>−∞. Problem Setting.We consider min x∈Rdx min y∈Rdy max z∈Rdz E(x, y, z), E(x, y, z) :=f(x, y)−g(x, z).(83) For each fixedx, min y max z E(x, y, z) = min y f(x, y)−min z g(x, z) =f ∗(x)−g ∗(x) =L(x). We study the gradient flow associated with (83):...

work page