Glocal Smoothness: Line search and adaptive step sizes can help in theory too!

Aaron Mishkin; Curtis Fox; Mark Schmidt; Sharan Vaswani

arxiv: 2506.12648 · v2 · pith:R7GKDYAYnew · submitted 2025-06-14 · 🧮 math.OC · cs.LG

Glocal Smoothness: Line search and adaptive step sizes can help in theory too!

Curtis Fox , Aaron Mishkin , Sharan Vaswani , Mark Schmidt This is my paper

Pith reviewed 2026-05-22 00:51 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords glocal smoothnessline searchadaptive step sizeiteration complexitygradient descentfirst-order methodssmooth optimization

0 comments

The pith

Glocal smoothness lets line-search gradient descent achieve better iteration complexity than fixed-step accelerated methods in some settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces glocal smoothness, a property of the objective function that captures both its global and local smoothness behavior using only function values rather than any algorithm iterates. This iterate-independent assumption makes it possible to prove concrete upper bounds on the number of iterations needed by different first-order methods. Under glocal smoothness, simple line searches are shown to improve rates over any fixed step size, and in certain cases plain gradient descent with a line search beats accelerated gradient descent that uses a fixed step size. The same property yields improved guarantees for Polyak steps, AdGD, stochastic gradients, coordinate descent, and nonlinear conjugate gradient methods.

Core claim

Under the glocal smoothness assumption, which depends solely on properties of the objective function, line searches and adaptive step sizes yield upper bounds on iteration complexities in terms of iterate-independent constants. This framework demonstrates advantages of line searches over fixed step sizes and shows that gradient descent with line search can have better iteration complexity than accelerated methods with fixed step sizes in certain settings. Similar improvements hold for Polyak and AdGD step sizes as well as other first-order methods.

What carries the argument

The glocal smoothness property, a function-only characterization of global and local gradient Lipschitz constants that enables complexity bounds independent of algorithm iterates.

If this is right

Line searches produce better iteration complexities than fixed step sizes for gradient descent.
In some settings gradient descent with line search has better iteration complexity than accelerated methods with fixed step sizes.
Polyak and AdGD adaptive steps also obtain improved iteration bounds.
Coordinate optimization, stochastic gradient methods, accelerated gradient methods, and nonlinear conjugate gradient methods all inherit better rates under the same assumption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The assumption could let practitioners estimate local smoothness constants directly from function evaluations without tracking iterates.
Similar function-only smoothness notions might simplify analysis of other adaptive schemes such as trust-region or quasi-Newton methods.
The results suggest that many existing local-Lipschitz proofs could be restated in iterate-independent terms for easier comparison.
Numerical checks on standard test functions could verify whether the predicted ordering of iteration counts appears in practice.

Load-bearing premise

The objective function satisfies the glocal smoothness property defined solely in terms of function properties rather than algorithm iterates.

What would settle it

A concrete quadratic or piecewise-quadratic function that satisfies glocal smoothness, together with explicit iteration counts showing whether gradient descent plus line search uses fewer steps than Nesterov acceleration with its theoretically optimal fixed step size.

Figures

Figures reproduced from arXiv: 2506.12648 by Aaron Mishkin, Curtis Fox, Mark Schmidt, Sharan Vaswani.

**Figure 2.** Figure 2: Illustration of proposition 7 along the slice [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Glocal smoothness gives a function-only assumption that yields iterate-independent bounds, letting them show line-search GD can beat fixed-step accelerated methods in some cases.

read the letter

The main point is that this paper defines glocal smoothness as a property of the function itself rather than the algorithm path. That single change lets them write iteration complexity bounds that stay independent of the iterates, which makes direct comparisons between methods possible without the usual headaches from prior local Lipschitz assumptions. They use it to show that line searches improve over fixed steps in a clean way, and that in some regimes plain gradient descent with line search has better complexity than accelerated gradient descent with fixed steps. They also carry the idea over to Polyak steps, AdGD, coordinate methods, stochastic gradients, accelerated variants, and nonlinear conjugate gradient. That breadth is useful. The assumption is simple enough that the advantages become straightforward to derive once you adopt it. The soft spot is that glocal smoothness still has to hold across the whole domain in a specific way, and it is not obvious how often this is true for the non-convex losses common in machine learning. The abstract calls the proofs straightforward, but the quantitative size of the improvement will matter; if the constants end up loose, the theoretical edge may not change much in practice. This paper is for researchers who analyze first-order methods and want better tools to justify adaptive and line-search heuristics without iterate-dependent rates. A reader working on convergence theory for training algorithms would get something concrete from the new assumption and the comparisons. The central argument looks solid with no evident circularity or hidden dependence. It deserves a serious referee because it fills a real gap in how we handle local smoothness for algorithm comparisons. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a 'glocal smoothness' assumption on the objective function that encodes both global and local gradient Lipschitz behavior using only function properties, independent of any algorithm iterates. This assumption is used to derive iterate-independent iteration complexity upper bounds for first-order methods. The central claims are that line-search gradient descent enjoys advantages over fixed-step gradient descent, and that in some regimes its complexity is better than that of accelerated gradient descent with fixed steps; the framework is also applied to Polyak and AdGD step sizes as well as coordinate, stochastic, accelerated, and conjugate-gradient methods.

Significance. If the glocal smoothness property yields the stated iterate-independent bounds and the complexity comparisons are valid, the work supplies a useful theoretical device for justifying adaptive and line-search methods under a single function-only assumption. The ability to compare complexities directly across algorithms without iterate dependence is a clear strength and could help reconcile practical performance with worst-case theory.

major comments (2)

[§4] §4 (main complexity comparison): the claim that gradient descent with line search has better iteration complexity than accelerated gradient descent with fixed steps in 'some settings' requires the explicit big-O expressions in terms of the glocal parameters (global and local Lipschitz constants) to be stated side-by-side so that the regime where the inequality holds can be verified.
[Definition of glocal smoothness] Definition of glocal smoothness (likely §2): confirm that the definition is stated purely in terms of function values or gradient norms without any reference to a sequence of iterates, because the iterate-independence of the resulting bounds is load-bearing for all subsequent comparisons.

minor comments (2)

[Abstract] Abstract: the phrase 'in some settings' should be replaced by a brief, quantitative description of the regime (e.g., when the local Lipschitz constant is sufficiently smaller than the global one).
Consider adding a short numerical example or low-dimensional function that satisfies glocal smoothness with a large gap between local and global constants; this would make the practical relevance of the assumption concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and recommendation for minor revision. The comments help clarify the presentation of the glocal smoothness framework and its complexity comparisons. We address each major comment below and have incorporated revisions to make the key claims more explicit and verifiable.

read point-by-point responses

Referee: [§4] §4 (main complexity comparison): the claim that gradient descent with line search has better iteration complexity than accelerated gradient descent with fixed steps in 'some settings' requires the explicit big-O expressions in terms of the glocal parameters (global and local Lipschitz constants) to be stated side-by-side so that the regime where the inequality holds can be verified.

Authors: We agree that an explicit side-by-side comparison strengthens the claim. In the original manuscript, the iteration complexities are derived separately under the glocal smoothness assumption: GD with line search achieves a bound depending on the local Lipschitz parameter (often leading to faster rates when local smoothness is favorable), while fixed-step AGD depends on the global Lipschitz constant. To address this, we have added a new paragraph and table in the revised §4 that states the big-O expressions side-by-side in terms of the global constant L and local constant l (or equivalent glocal parameters), explicitly identifying the regimes (e.g., when l << L) where GD+line search has superior iteration complexity. This makes the 'some settings' claim directly verifiable without altering the underlying proofs. revision: yes
Referee: [Definition of glocal smoothness] Definition of glocal smoothness (likely §2): confirm that the definition is stated purely in terms of function values or gradient norms without any reference to a sequence of iterates, because the iterate-independence of the resulting bounds is load-bearing for all subsequent comparisons.

Authors: We confirm that the glocal smoothness definition in §2 is formulated exclusively in terms of the objective function f and its gradient properties (specifically, a global Lipschitz constant L together with a local smoothness parameter that depends only on function values or gradient norms at arbitrary points x). The definition contains no reference to any algorithm-generated sequence of iterates, which directly enables the iterate-independent complexity bounds used throughout the paper for all comparisons. To further emphasize this point for readers, we have added a short clarifying remark immediately after the definition. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from new assumption

full rationale

The paper defines glocal smoothness as a function-only property independent of iterates, then derives iteration complexity bounds and comparisons (e.g., line-search GD vs. fixed-step accelerated methods) directly from this assumption. No quoted steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the new assumption supplies independent content for the bounds rather than the bounds defining or presupposing the assumption. The derivation chain remains non-circular and externally falsifiable via the stated function properties.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central results rest on the newly introduced glocal smoothness assumption; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The objective function satisfies glocal smoothness, a characterization of global and local smoothness that depends only on properties of the function.
This assumption is invoked to obtain iterate-independent iteration complexity bounds that enable comparisons across algorithms.

pith-pipeline@v0.9.0 · 5761 in / 1158 out tokens · 64189 ms · 2026-05-22T00:51:56.635389+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A function f with minimum f* is glocally (L, L*, δ)-smooth if f is globally L-smooth, and locally L*-smooth for all u such that f(u)−f*≤δ
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 6 … T ≥ L/μ log(Δ0/δ) + L*/μ log(δ/ε)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Multinomial logistic regression algorithm.Ann

(cited on 6) Dankmar Böhning. Multinomial logistic regression algorithm.Ann. Inst. Stat. Math., 44(1):197–200, 1992. (cited on 4) Jérôme Bolte and Edouard Pauwels. Curiosities and counterexamples in smooth convex optimization.Math. Program., 195(1):553–603, 2022. (cited on 9) Stephen P. Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge, 2004. (c...

work page arXiv 1992
[2]

Curtis and Daniel P

(cited on 15) Frank E. Curtis and Daniel P. Robinson. Regional complexity analysis of algorithms for nonconvex smooth optimization. Math. Program., 187(1):579–615, 2021. (cited on 4) Shuvomoy Das Gupta, Robert M Freund, Xu Andy Sun, and Adrien Taylor. Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses.Mathema...

work page arXiv 2021
[3]

Nesterov

(cited on 14) Yurii E. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM J. Optim., 22(2):341–362, 2012. (cited on 10, 15, 28) Yurii E. Nesterov. Gradient methods for minimizing composite functions.Math. Program., 140(1):125–161, 2013b. (cited on 3, 14) Yurii E. Nesterov and Boris T. Polyak. Cubic regularization o...

work page arXiv 2012
[4]

If f is G-Lipschitz, then f is (L0 + L1G, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth

work page
[5]

If f is L-smooth, then f is (L, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth. Proof. Recall that a function isG-Lipschitz if∥∇f(x)∥ ≤ G for allx. If f also satisfies(L0, L1) non-uniform smoothness, then, ∥∇f(x) − ∇f(y)∥ ≤ (L0 + L1 G) ∥y − x∥ Hence, f is globally(L0 + L1G)-smooth. If insteadf is L-smooth, then by definition it is globallyL-smooth. We now proce...

work page 2023
[7]

Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ

The Polyak step does not guarantee decrease in the function value, but we remain in the region of interest once∥wt −w∗∥ becomes sufficiently small. Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ. By global smoothness we have f(w) − f∗ ≤ L 2 ∥w − w∗∥2, and thus, f(wtδ) − f∗ ≤ δ. Further by the monotonicity of∥wt − w∗∥, we havef(wt) − f∗ ≤ δ for all t ≥ tδ

work page
[8]

Initialized at w0, using lemma 25 with¯L = L, GD(Polyak) is guaranteed to haveL 2 ∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log L 2 ∥w0−w∗∥2 δ ⌉ iterations

work page
[9]

Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

By L∗ smoothness within the region of interest we havef(wt) − f∗ ≤ ϵ if we have L∗ 2 ∥wt − w∗∥2 ≤ ϵ for t ≥ tδ. Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page
[10]

Since L 2 ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log L∗ L δ ϵ ⌉ additional iterations. D.2 Example of Polyak Step Size Increasing Function Value In this section, we give an example showing that gradient descent with the Polyak step size can increase the function value while decreasing the2-norm distance to the minimizer. Consider the functio...

work page
[12]

It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

Let tδ be the first iteration such that∥wtδ − w∗∥2 ≤ δ. It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

work page
[13]

Initialized atw0, using Lemma 25 with¯L = L, GD(Polyak) is guaranteed to have∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log ∥w0−w∗∥2 δ ⌉ iterations

work page
[14]

Initialized at wtδ, using Lemma 25 with ¯L = L∗, we have that ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page
[15]

E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]

Since ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log δ ϵ ⌉ additional iterations. E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]. Lemma 27.Assume thatf is ¯L-smooth andµ-strongly convex. For allt > 2 the iterates of ...

work page 2020
[17]

Now let tδ = t ′ + 2

Let t ′ be the first iteration such thatΦt′ ≤ δ and t ′ > 2. Now let tδ = t ′ + 2. Since ∥wt − w∗∥2 ≤ Φt for all t, it follows from the monotonicity ofΦt that all t ≥ tδ − 1 stay in the region of interest

work page
[18]

Initialized at w0, using Lemma 27 with¯L = L, AdGD is guaranteed to have∥wtδ − w∗∥2 ≤ Φtδ ≤ δ after tδ = O L µ log Φ3 δ iterations

work page
[19]

Initialized at wtδ, using Lemma 27 with ¯L = L∗, we have that∥wt − w∗∥2 ≤ Φt ≤ ϵ after another t = O L∗ µ log Φtδ ϵ iterations

work page
[20]

26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions

Since Φtδ ≤ δ, it is sufficient to havet = O L∗ µ log δ ϵ additional iterations. 26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions. We give such a result next, following an argument similar to previous work [Beck and Tetruashvili, 2013, Section 3]. Lemma 28.Assume thatf is ¯L-s...

work page 2013
[21]

Initialized at w0, using Lemma 28 with¯L = L, GD(LO) is guaranteed to havef(wt) − f∗ ≤ δ after tδ = O L δ R2(f(w0)) iterations. F.2 Globally Convex and Locally Convex Proof of Theorem 13.We only need to specify Steps 4 and 5 as the other steps are identical to the analysis of GD(LO) in the globally convex and locally strongly-convex case:

work page
[22]

Initialized atwtδ, using Corollary 8 Lemma 28 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after another t = O L∗ ϵ R2(f(wtδ)) iterations

work page
[23]

G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17

Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = O L∗ ϵ R2(δ + f∗) additional iterations. G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17. We first review a standard result regarding the convergence rate of random and greedy coordinate descent [Nesterov, 2012, Nutini et al., 2015]. Lemma 29. Assume that ...

work page 2012
[25]

Let ρ = 1 − µ dL where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (f(wt) − f∗ > λ(f(w0) − f∗)) ≤ E[f(wt)] − f∗ λ(f(w0) − f∗) ≤ ρt λ . (Using the convergence guarantee) Choosing λ = δ f(w0)−f∗ we obtain P (f(wt) − f∗ > δ) ≤ ρt[f(w0) − f∗] δ . ...

work page 2017
[26]

Initialized at w0, using Lemma 29 with ¯L = L, random coordinate descent satisfiesf(wt) − f∗ ≤ δ after tδ = ⌈ dL µ log f(w0)−f∗ δζ ⌉ iterations with probability1 − ζ

work page
[27]

Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

Define Xδ as the event thatf(wtδ) − f∗ ≤ δ. Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

work page
[28]

Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations

Conditioned on Xδ, f(wtδ) − f∗ ≤ δ. Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations. Proof of Theorem 17.We follow the analysis structure outlined in Section 3.6:

work page
[29]

Our region of interest under coordinate-wise glocal smoothness is{w|f(w) − f∗ ≤ δ}

work page
[30]

It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

Let tδ be the first iteration such thatf(wtδ) − f∗ ≤ δ. It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

work page
[31]

Initialized at w0, using Lemma 29 with¯L = L, coordinate descent with greedy coordinate selection is guaranteed to havef(wt) − f∗ ≤ δ after tδ = ⌈ L µ1 log f(w0)−f∗ δ ⌉ iterations

work page
[32]

Initialized at wtδ, using Corollary 30 and Lemma 29 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after an additional t = ⌈ L∗ µ1 log f(wtδ )−f∗ ϵ ⌉ iterations

work page
[33]

Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = ⌈ L∗ µ1 log δ ϵ ⌉ additional iterations. 29 H Stochastic Gradient Descent Convergence Results The following result is implied by existing work [Mishkin, 2020, Theorem 9] on using a stochastic line search under the interpolation assumption. Earlier work [Vaswani et al., 2019, Theorem 1] also gives a s...

work page 2020
[34]

Our region of interest under glocal smoothness of the iterates is{w|∥w − w∗∥2 ≤ δ}

work page
[35]

Let ρ = 1 − µ 2Lmax where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (∥wt − w∗∥2 > λ∥w0 − w∗∥2) ≤ E[∥wt − w∗∥2] λ∥w0 − w∗∥2 ≤ ρt λ . Choosing λ = δ ∥w0−w∗∥2 we obtain P (∥wt − w∗∥2 > δ) ≤ ρt∥w0 − w∗∥2 δ . For a probabilityζ such that 0 < ζ < ...

work page 2017
[36]

Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

Initialized at w0, using Lemma 31 with¯L = Lmax, stochastic gradient descent satisfies∥wt − w∗∥2 ≤ δ after tδ = l 2Lmax µ log ∥w0−w∗∥2 δζ m iterations with probability1 − ζ. Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

work page
[37]

Define Xδ as the event that∥wtδ − w∗∥ ≤ δ. Initialized at wtδ, using Lemma 31 with ¯L = Lmax,∗, and noting that we assume ηmax ≥ 1 Lmax,∗ , we have that E[∥wT − w∗∥ | Xδ] ≤ ϵ after another t = ⌈ 2Lmax,∗ µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page
[38]

T −1Y t=0 (1 − √ηt · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 , f(zT ) − f(w∗) ≤ ¯L µ

Conditioned on Xδ, ∥wtδ − w∗∥2 ≤ δ. Hence, it is sufficient to havet = ⌈ 2Lmax,∗ µ log δ ϵ ⌉ additional iterations. 30 I Acceleration Convergence Results Lemma 32.If f is µ-strongly convex, then the maximum step-size satisfying the Armijo condition in eq. (14) is bounded as, sup n η > 0 : f(wt − η∇f(wt)) ≤ f(wt) − η 2 ∥∇f(wt)∥2 o ≤ 1 µ . Proof. Since f is...

work page 2021
[39]

Our region of interest under glocal smoothness is{w|f(w) − f∗ ≤ δ}

work page
[40]

Let tδ be the firstt for which ht ≤ δ

By proposition 34, we know that max {f(wt), f(yt)} ≤ L µ "t−1Y i=0 (1 − √ηi · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 =: ht. Let tδ be the firstt for which ht ≤ δ. Since ht is monotone decreasing int, ht ≤ htδ ≤ δ for all t ≥ tδ. Thus, f(wt) − f∗, f(yt) − f∗ ≤ δ for all t ≥ tδ

work page
[41]

(14) satisfies ηt ≥ 2/L for all t ∈ N

Since f is globally L-smooth, the step-size selected by backtracking line-search on eq. (14) satisfies ηt ≥ 2/L for all t ∈ N. As a result, the sequenceht is controlled as, ht ≤ L µ 1 − r µ 2L t f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 . We conclude that an upper-bound on the iteration complexity forwt, yt to enter and remain in the δ + f∗ sub-level set is given ...

work page
[42]

It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set

We have established thatf(yt) − f∗, f(wt) − f∗ ≤ δ for all t ≥ tδ. It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set. In particular, we shall showηt ≥ 1 2L∗ for all t ≥ tδ. proposition 7 implies that for allt ≥ tδ and η ≤ 1/L∗, ˜wt+1(η) = yt − η∇f(yt), satisfies f( ...

work page
[43]

33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21

Combining the two complexities implies we require at most T ≤ s 2L µ log L f(w0) − f∗ + µ 2 ∥w0 − w∗∥2 µ · δ ! + s 2L∗ µ log µ · δ L · ϵ , iterations to compute anϵ sub-optimal point. 33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21. We rely on the following result regarding the performance of the PRP ...

work page 2024
[44]

Our region of interest under glocal smoothness and is{w | f(w) − f∗ ≤ δ}

work page
[45]

From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

Let tδ be any iteration wheref(wtδ) − f∗ ≤ δ. From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

work page
[46]

From Lemma 35, PRP NLCG initialized atw0 is guaranteed to satisfy f(wt) − f∗ ≤ δ after tδ = 1 2 L µ 2 log f(w0)−f∗ δ iterations. 34 Within the local region, there are 3 different ways we could reach an accuracy ofϵ: (a) we might reach an accuracy of ϵ before the first reset occurs due to the global rate from Lemma 35, (b) we might reach an accuracy of ϵ a...

work page
[47]

Initialized at wtδ, from Lemma 35 we require an additionalt = 1 2 L µ 2 log f(wtδ )−f∗ ϵ iterations to have f(wt) − f∗ ≤ ϵ

work page
[48]

We now turn to case (b):

Using that f(wtδ) ≤ δ, it is sufficient to havet = 1 2 L µ 2 log δ ϵ iterations. We now turn to case (b):

work page
[49]

Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG

In order to exploit the local quadratic property, we require at most(d −1) additional iterations beyond tδ. Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG. Using Lemma 36 with¯L = L∗, PRP NLCG is guaranteed to satisfy f(wt) − f∗ ≤ ϵ after another t = 1 2 q L∗ µ log 4 L...

work page
[50]

Using that f(wtδ) − f∗ ≤ δ and Lemma 35, we have f(wtδ+(d−1)) − f∗ ≤ 1−(µ/L)2 1+(µ/L)2 2(d−1) δ ≤ exp(−2(d−1)µ2/L2)δ. , wecanconcludethatitissufficienttohave t = 1 2 q L∗ µ log 4 L∗ µ 2 δ ϵ − (d − 1) q L∗ µ µ2 L2 additional iterations (the negative term reduces the iteration count based on the progress made before the reset, which we have omitted in the f...

work page 2015

[1] [1]

Multinomial logistic regression algorithm.Ann

(cited on 6) Dankmar Böhning. Multinomial logistic regression algorithm.Ann. Inst. Stat. Math., 44(1):197–200, 1992. (cited on 4) Jérôme Bolte and Edouard Pauwels. Curiosities and counterexamples in smooth convex optimization.Math. Program., 195(1):553–603, 2022. (cited on 9) Stephen P. Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge, 2004. (c...

work page arXiv 1992

[2] [2]

Curtis and Daniel P

(cited on 15) Frank E. Curtis and Daniel P. Robinson. Regional complexity analysis of algorithms for nonconvex smooth optimization. Math. Program., 187(1):579–615, 2021. (cited on 4) Shuvomoy Das Gupta, Robert M Freund, Xu Andy Sun, and Adrien Taylor. Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses.Mathema...

work page arXiv 2021

[3] [3]

Nesterov

(cited on 14) Yurii E. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM J. Optim., 22(2):341–362, 2012. (cited on 10, 15, 28) Yurii E. Nesterov. Gradient methods for minimizing composite functions.Math. Program., 140(1):125–161, 2013b. (cited on 3, 14) Yurii E. Nesterov and Boris T. Polyak. Cubic regularization o...

work page arXiv 2012

[4] [4]

If f is G-Lipschitz, then f is (L0 + L1G, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth

work page

[5] [5]

If f is L-smooth, then f is (L, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth. Proof. Recall that a function isG-Lipschitz if∥∇f(x)∥ ≤ G for allx. If f also satisfies(L0, L1) non-uniform smoothness, then, ∥∇f(x) − ∇f(y)∥ ≤ (L0 + L1 G) ∥y − x∥ Hence, f is globally(L0 + L1G)-smooth. If insteadf is L-smooth, then by definition it is globallyL-smooth. We now proce...

work page 2023

[6] [7]

Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ

The Polyak step does not guarantee decrease in the function value, but we remain in the region of interest once∥wt −w∗∥ becomes sufficiently small. Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ. By global smoothness we have f(w) − f∗ ≤ L 2 ∥w − w∗∥2, and thus, f(wtδ) − f∗ ≤ δ. Further by the monotonicity of∥wt − w∗∥, we havef(wt) − f∗ ≤ δ for all t ≥ tδ

work page

[7] [8]

Initialized at w0, using lemma 25 with¯L = L, GD(Polyak) is guaranteed to haveL 2 ∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log L 2 ∥w0−w∗∥2 δ ⌉ iterations

work page

[8] [9]

Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

By L∗ smoothness within the region of interest we havef(wt) − f∗ ≤ ϵ if we have L∗ 2 ∥wt − w∗∥2 ≤ ϵ for t ≥ tδ. Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page

[9] [10]

Since L 2 ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log L∗ L δ ϵ ⌉ additional iterations. D.2 Example of Polyak Step Size Increasing Function Value In this section, we give an example showing that gradient descent with the Polyak step size can increase the function value while decreasing the2-norm distance to the minimizer. Consider the functio...

work page

[10] [12]

It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

Let tδ be the first iteration such that∥wtδ − w∗∥2 ≤ δ. It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

work page

[11] [13]

Initialized atw0, using Lemma 25 with¯L = L, GD(Polyak) is guaranteed to have∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log ∥w0−w∗∥2 δ ⌉ iterations

work page

[12] [14]

Initialized at wtδ, using Lemma 25 with ¯L = L∗, we have that ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page

[13] [15]

E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]

Since ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log δ ϵ ⌉ additional iterations. E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]. Lemma 27.Assume thatf is ¯L-smooth andµ-strongly convex. For allt > 2 the iterates of ...

work page 2020

[14] [17]

Now let tδ = t ′ + 2

Let t ′ be the first iteration such thatΦt′ ≤ δ and t ′ > 2. Now let tδ = t ′ + 2. Since ∥wt − w∗∥2 ≤ Φt for all t, it follows from the monotonicity ofΦt that all t ≥ tδ − 1 stay in the region of interest

work page

[15] [18]

Initialized at w0, using Lemma 27 with¯L = L, AdGD is guaranteed to have∥wtδ − w∗∥2 ≤ Φtδ ≤ δ after tδ = O L µ log Φ3 δ iterations

work page

[16] [19]

Initialized at wtδ, using Lemma 27 with ¯L = L∗, we have that∥wt − w∗∥2 ≤ Φt ≤ ϵ after another t = O L∗ µ log Φtδ ϵ iterations

work page

[17] [20]

26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions

Since Φtδ ≤ δ, it is sufficient to havet = O L∗ µ log δ ϵ additional iterations. 26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions. We give such a result next, following an argument similar to previous work [Beck and Tetruashvili, 2013, Section 3]. Lemma 28.Assume thatf is ¯L-s...

work page 2013

[18] [21]

Initialized at w0, using Lemma 28 with¯L = L, GD(LO) is guaranteed to havef(wt) − f∗ ≤ δ after tδ = O L δ R2(f(w0)) iterations. F.2 Globally Convex and Locally Convex Proof of Theorem 13.We only need to specify Steps 4 and 5 as the other steps are identical to the analysis of GD(LO) in the globally convex and locally strongly-convex case:

work page

[19] [22]

Initialized atwtδ, using Corollary 8 Lemma 28 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after another t = O L∗ ϵ R2(f(wtδ)) iterations

work page

[20] [23]

G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17

Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = O L∗ ϵ R2(δ + f∗) additional iterations. G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17. We first review a standard result regarding the convergence rate of random and greedy coordinate descent [Nesterov, 2012, Nutini et al., 2015]. Lemma 29. Assume that ...

work page 2012

[21] [25]

Let ρ = 1 − µ dL where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (f(wt) − f∗ > λ(f(w0) − f∗)) ≤ E[f(wt)] − f∗ λ(f(w0) − f∗) ≤ ρt λ . (Using the convergence guarantee) Choosing λ = δ f(w0)−f∗ we obtain P (f(wt) − f∗ > δ) ≤ ρt[f(w0) − f∗] δ . ...

work page 2017

[22] [26]

Initialized at w0, using Lemma 29 with ¯L = L, random coordinate descent satisfiesf(wt) − f∗ ≤ δ after tδ = ⌈ dL µ log f(w0)−f∗ δζ ⌉ iterations with probability1 − ζ

work page

[23] [27]

Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

Define Xδ as the event thatf(wtδ) − f∗ ≤ δ. Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

work page

[24] [28]

Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations

Conditioned on Xδ, f(wtδ) − f∗ ≤ δ. Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations. Proof of Theorem 17.We follow the analysis structure outlined in Section 3.6:

work page

[25] [29]

Our region of interest under coordinate-wise glocal smoothness is{w|f(w) − f∗ ≤ δ}

work page

[26] [30]

It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

Let tδ be the first iteration such thatf(wtδ) − f∗ ≤ δ. It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

work page

[27] [31]

Initialized at w0, using Lemma 29 with¯L = L, coordinate descent with greedy coordinate selection is guaranteed to havef(wt) − f∗ ≤ δ after tδ = ⌈ L µ1 log f(w0)−f∗ δ ⌉ iterations

work page

[28] [32]

Initialized at wtδ, using Corollary 30 and Lemma 29 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after an additional t = ⌈ L∗ µ1 log f(wtδ )−f∗ ϵ ⌉ iterations

work page

[29] [33]

Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = ⌈ L∗ µ1 log δ ϵ ⌉ additional iterations. 29 H Stochastic Gradient Descent Convergence Results The following result is implied by existing work [Mishkin, 2020, Theorem 9] on using a stochastic line search under the interpolation assumption. Earlier work [Vaswani et al., 2019, Theorem 1] also gives a s...

work page 2020

[30] [34]

Our region of interest under glocal smoothness of the iterates is{w|∥w − w∗∥2 ≤ δ}

work page

[31] [35]

Let ρ = 1 − µ 2Lmax where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (∥wt − w∗∥2 > λ∥w0 − w∗∥2) ≤ E[∥wt − w∗∥2] λ∥w0 − w∗∥2 ≤ ρt λ . Choosing λ = δ ∥w0−w∗∥2 we obtain P (∥wt − w∗∥2 > δ) ≤ ρt∥w0 − w∗∥2 δ . For a probabilityζ such that 0 < ζ < ...

work page 2017

[32] [36]

Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

Initialized at w0, using Lemma 31 with¯L = Lmax, stochastic gradient descent satisfies∥wt − w∗∥2 ≤ δ after tδ = l 2Lmax µ log ∥w0−w∗∥2 δζ m iterations with probability1 − ζ. Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

work page

[33] [37]

Define Xδ as the event that∥wtδ − w∗∥ ≤ δ. Initialized at wtδ, using Lemma 31 with ¯L = Lmax,∗, and noting that we assume ηmax ≥ 1 Lmax,∗ , we have that E[∥wT − w∗∥ | Xδ] ≤ ϵ after another t = ⌈ 2Lmax,∗ µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

work page

[34] [38]

T −1Y t=0 (1 − √ηt · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 , f(zT ) − f(w∗) ≤ ¯L µ

Conditioned on Xδ, ∥wtδ − w∗∥2 ≤ δ. Hence, it is sufficient to havet = ⌈ 2Lmax,∗ µ log δ ϵ ⌉ additional iterations. 30 I Acceleration Convergence Results Lemma 32.If f is µ-strongly convex, then the maximum step-size satisfying the Armijo condition in eq. (14) is bounded as, sup n η > 0 : f(wt − η∇f(wt)) ≤ f(wt) − η 2 ∥∇f(wt)∥2 o ≤ 1 µ . Proof. Since f is...

work page 2021

[35] [39]

Our region of interest under glocal smoothness is{w|f(w) − f∗ ≤ δ}

work page

[36] [40]

Let tδ be the firstt for which ht ≤ δ

By proposition 34, we know that max {f(wt), f(yt)} ≤ L µ "t−1Y i=0 (1 − √ηi · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 =: ht. Let tδ be the firstt for which ht ≤ δ. Since ht is monotone decreasing int, ht ≤ htδ ≤ δ for all t ≥ tδ. Thus, f(wt) − f∗, f(yt) − f∗ ≤ δ for all t ≥ tδ

work page

[37] [41]

(14) satisfies ηt ≥ 2/L for all t ∈ N

Since f is globally L-smooth, the step-size selected by backtracking line-search on eq. (14) satisfies ηt ≥ 2/L for all t ∈ N. As a result, the sequenceht is controlled as, ht ≤ L µ 1 − r µ 2L t f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 . We conclude that an upper-bound on the iteration complexity forwt, yt to enter and remain in the δ + f∗ sub-level set is given ...

work page

[38] [42]

It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set

We have established thatf(yt) − f∗, f(wt) − f∗ ≤ δ for all t ≥ tδ. It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set. In particular, we shall showηt ≥ 1 2L∗ for all t ≥ tδ. proposition 7 implies that for allt ≥ tδ and η ≤ 1/L∗, ˜wt+1(η) = yt − η∇f(yt), satisfies f( ...

work page

[39] [43]

33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21

Combining the two complexities implies we require at most T ≤ s 2L µ log L f(w0) − f∗ + µ 2 ∥w0 − w∗∥2 µ · δ ! + s 2L∗ µ log µ · δ L · ϵ , iterations to compute anϵ sub-optimal point. 33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21. We rely on the following result regarding the performance of the PRP ...

work page 2024

[40] [44]

Our region of interest under glocal smoothness and is{w | f(w) − f∗ ≤ δ}

work page

[41] [45]

From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

Let tδ be any iteration wheref(wtδ) − f∗ ≤ δ. From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

work page

[42] [46]

From Lemma 35, PRP NLCG initialized atw0 is guaranteed to satisfy f(wt) − f∗ ≤ δ after tδ = 1 2 L µ 2 log f(w0)−f∗ δ iterations. 34 Within the local region, there are 3 different ways we could reach an accuracy ofϵ: (a) we might reach an accuracy of ϵ before the first reset occurs due to the global rate from Lemma 35, (b) we might reach an accuracy of ϵ a...

work page

[43] [47]

Initialized at wtδ, from Lemma 35 we require an additionalt = 1 2 L µ 2 log f(wtδ )−f∗ ϵ iterations to have f(wt) − f∗ ≤ ϵ

work page

[44] [48]

We now turn to case (b):

Using that f(wtδ) ≤ δ, it is sufficient to havet = 1 2 L µ 2 log δ ϵ iterations. We now turn to case (b):

work page

[45] [49]

Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG

In order to exploit the local quadratic property, we require at most(d −1) additional iterations beyond tδ. Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG. Using Lemma 36 with¯L = L∗, PRP NLCG is guaranteed to satisfy f(wt) − f∗ ≤ ϵ after another t = 1 2 q L∗ µ log 4 L...

work page

[46] [50]

Using that f(wtδ) − f∗ ≤ δ and Lemma 35, we have f(wtδ+(d−1)) − f∗ ≤ 1−(µ/L)2 1+(µ/L)2 2(d−1) δ ≤ exp(−2(d−1)µ2/L2)δ. , wecanconcludethatitissufficienttohave t = 1 2 q L∗ µ log 4 L∗ µ 2 δ ϵ − (d − 1) q L∗ µ µ2 L2 additional iterations (the negative term reduces the iteration count based on the progress made before the reset, which we have omitted in the f...

work page 2015