pith. sign in

arxiv: 2506.12648 · v2 · pith:R7GKDYAYnew · submitted 2025-06-14 · 🧮 math.OC · cs.LG

Glocal Smoothness: Line search and adaptive step sizes can help in theory too!

Pith reviewed 2026-05-22 00:51 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords glocal smoothnessline searchadaptive step sizeiteration complexitygradient descentfirst-order methodssmooth optimization
0
0 comments X

The pith

Glocal smoothness lets line-search gradient descent achieve better iteration complexity than fixed-step accelerated methods in some settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces glocal smoothness, a property of the objective function that captures both its global and local smoothness behavior using only function values rather than any algorithm iterates. This iterate-independent assumption makes it possible to prove concrete upper bounds on the number of iterations needed by different first-order methods. Under glocal smoothness, simple line searches are shown to improve rates over any fixed step size, and in certain cases plain gradient descent with a line search beats accelerated gradient descent that uses a fixed step size. The same property yields improved guarantees for Polyak steps, AdGD, stochastic gradients, coordinate descent, and nonlinear conjugate gradient methods.

Core claim

Under the glocal smoothness assumption, which depends solely on properties of the objective function, line searches and adaptive step sizes yield upper bounds on iteration complexities in terms of iterate-independent constants. This framework demonstrates advantages of line searches over fixed step sizes and shows that gradient descent with line search can have better iteration complexity than accelerated methods with fixed step sizes in certain settings. Similar improvements hold for Polyak and AdGD step sizes as well as other first-order methods.

What carries the argument

The glocal smoothness property, a function-only characterization of global and local gradient Lipschitz constants that enables complexity bounds independent of algorithm iterates.

If this is right

  • Line searches produce better iteration complexities than fixed step sizes for gradient descent.
  • In some settings gradient descent with line search has better iteration complexity than accelerated methods with fixed step sizes.
  • Polyak and AdGD adaptive steps also obtain improved iteration bounds.
  • Coordinate optimization, stochastic gradient methods, accelerated gradient methods, and nonlinear conjugate gradient methods all inherit better rates under the same assumption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The assumption could let practitioners estimate local smoothness constants directly from function evaluations without tracking iterates.
  • Similar function-only smoothness notions might simplify analysis of other adaptive schemes such as trust-region or quasi-Newton methods.
  • The results suggest that many existing local-Lipschitz proofs could be restated in iterate-independent terms for easier comparison.
  • Numerical checks on standard test functions could verify whether the predicted ordering of iteration counts appears in practice.

Load-bearing premise

The objective function satisfies the glocal smoothness property defined solely in terms of function properties rather than algorithm iterates.

What would settle it

A concrete quadratic or piecewise-quadratic function that satisfies glocal smoothness, together with explicit iteration counts showing whether gradient descent plus line search uses fewer steps than Nesterov acceleration with its theoretically optimal fixed step size.

Figures

Figures reproduced from arXiv: 2506.12648 by Aaron Mishkin, Curtis Fox, Mark Schmidt, Sharan Vaswani.

Figure 1
Figure 1. Figure 1: Logistic regression on 4 datasets showing the optimality gap throughout training for gradient [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of proposition 7 along the slice [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a 'glocal smoothness' assumption on the objective function that encodes both global and local gradient Lipschitz behavior using only function properties, independent of any algorithm iterates. This assumption is used to derive iterate-independent iteration complexity upper bounds for first-order methods. The central claims are that line-search gradient descent enjoys advantages over fixed-step gradient descent, and that in some regimes its complexity is better than that of accelerated gradient descent with fixed steps; the framework is also applied to Polyak and AdGD step sizes as well as coordinate, stochastic, accelerated, and conjugate-gradient methods.

Significance. If the glocal smoothness property yields the stated iterate-independent bounds and the complexity comparisons are valid, the work supplies a useful theoretical device for justifying adaptive and line-search methods under a single function-only assumption. The ability to compare complexities directly across algorithms without iterate dependence is a clear strength and could help reconcile practical performance with worst-case theory.

major comments (2)
  1. [§4] §4 (main complexity comparison): the claim that gradient descent with line search has better iteration complexity than accelerated gradient descent with fixed steps in 'some settings' requires the explicit big-O expressions in terms of the glocal parameters (global and local Lipschitz constants) to be stated side-by-side so that the regime where the inequality holds can be verified.
  2. [Definition of glocal smoothness] Definition of glocal smoothness (likely §2): confirm that the definition is stated purely in terms of function values or gradient norms without any reference to a sequence of iterates, because the iterate-independence of the resulting bounds is load-bearing for all subsequent comparisons.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'in some settings' should be replaced by a brief, quantitative description of the regime (e.g., when the local Lipschitz constant is sufficiently smaller than the global one).
  2. Consider adding a short numerical example or low-dimensional function that satisfies glocal smoothness with a large gap between local and global constants; this would make the practical relevance of the assumption concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and recommendation for minor revision. The comments help clarify the presentation of the glocal smoothness framework and its complexity comparisons. We address each major comment below and have incorporated revisions to make the key claims more explicit and verifiable.

read point-by-point responses
  1. Referee: [§4] §4 (main complexity comparison): the claim that gradient descent with line search has better iteration complexity than accelerated gradient descent with fixed steps in 'some settings' requires the explicit big-O expressions in terms of the glocal parameters (global and local Lipschitz constants) to be stated side-by-side so that the regime where the inequality holds can be verified.

    Authors: We agree that an explicit side-by-side comparison strengthens the claim. In the original manuscript, the iteration complexities are derived separately under the glocal smoothness assumption: GD with line search achieves a bound depending on the local Lipschitz parameter (often leading to faster rates when local smoothness is favorable), while fixed-step AGD depends on the global Lipschitz constant. To address this, we have added a new paragraph and table in the revised §4 that states the big-O expressions side-by-side in terms of the global constant L and local constant l (or equivalent glocal parameters), explicitly identifying the regimes (e.g., when l << L) where GD+line search has superior iteration complexity. This makes the 'some settings' claim directly verifiable without altering the underlying proofs. revision: yes

  2. Referee: [Definition of glocal smoothness] Definition of glocal smoothness (likely §2): confirm that the definition is stated purely in terms of function values or gradient norms without any reference to a sequence of iterates, because the iterate-independence of the resulting bounds is load-bearing for all subsequent comparisons.

    Authors: We confirm that the glocal smoothness definition in §2 is formulated exclusively in terms of the objective function f and its gradient properties (specifically, a global Lipschitz constant L together with a local smoothness parameter that depends only on function values or gradient norms at arbitrary points x). The definition contains no reference to any algorithm-generated sequence of iterates, which directly enables the iterate-independent complexity bounds used throughout the paper for all comparisons. To further emphasize this point for readers, we have added a short clarifying remark immediately after the definition. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from new assumption

full rationale

The paper defines glocal smoothness as a function-only property independent of iterates, then derives iteration complexity bounds and comparisons (e.g., line-search GD vs. fixed-step accelerated methods) directly from this assumption. No quoted steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the new assumption supplies independent content for the bounds rather than the bounds defining or presupposing the assumption. The derivation chain remains non-circular and externally falsifiable via the stated function properties.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central results rest on the newly introduced glocal smoothness assumption; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption The objective function satisfies glocal smoothness, a characterization of global and local smoothness that depends only on properties of the function.
    This assumption is invoked to obtain iterate-independent iteration complexity bounds that enable comparisons across algorithms.

pith-pipeline@v0.9.0 · 5761 in / 1158 out tokens · 64189 ms · 2026-05-22T00:51:56.635389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Multinomial logistic regression algorithm.Ann

    (cited on 6) Dankmar Böhning. Multinomial logistic regression algorithm.Ann. Inst. Stat. Math., 44(1):197–200, 1992. (cited on 4) Jérôme Bolte and Edouard Pauwels. Curiosities and counterexamples in smooth convex optimization.Math. Program., 195(1):553–603, 2022. (cited on 9) Stephen P. Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge, 2004. (c...

  2. [2]

    Curtis and Daniel P

    (cited on 15) Frank E. Curtis and Daniel P. Robinson. Regional complexity analysis of algorithms for nonconvex smooth optimization. Math. Program., 187(1):579–615, 2021. (cited on 4) Shuvomoy Das Gupta, Robert M Freund, Xu Andy Sun, and Adrien Taylor. Nonlinear conjugate gradient methods: worst-case convergence rates via computer-assisted analyses.Mathema...

  3. [3]

    Nesterov

    (cited on 14) Yurii E. Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM J. Optim., 22(2):341–362, 2012. (cited on 10, 15, 28) Yurii E. Nesterov. Gradient methods for minimizing composite functions.Math. Program., 140(1):125–161, 2013b. (cited on 3, 14) Yurii E. Nesterov and Boris T. Polyak. Cubic regularization o...

  4. [4]

    If f is G-Lipschitz, then f is (L0 + L1G, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth

  5. [5]

    If f is L-smooth, then f is (L, L0 + L1(4L1δ + 2 √L0δ), δ)-glocal smooth. Proof. Recall that a function isG-Lipschitz if∥∇f(x)∥ ≤ G for allx. If f also satisfies(L0, L1) non-uniform smoothness, then, ∥∇f(x) − ∇f(y)∥ ≤ (L0 + L1 G) ∥y − x∥ Hence, f is globally(L0 + L1G)-smooth. If insteadf is L-smooth, then by definition it is globallyL-smooth. We now proce...

  6. [7]

    Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ

    The Polyak step does not guarantee decrease in the function value, but we remain in the region of interest once∥wt −w∗∥ becomes sufficiently small. Lettδ be the first iteration whereL 2 ∥wtδ −w∗∥2 ≤ δ. By global smoothness we have f(w) − f∗ ≤ L 2 ∥w − w∗∥2, and thus, f(wtδ) − f∗ ≤ δ. Further by the monotonicity of∥wt − w∗∥, we havef(wt) − f∗ ≤ δ for all t ≥ tδ

  7. [8]

    Initialized at w0, using lemma 25 with¯L = L, GD(Polyak) is guaranteed to haveL 2 ∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log L 2 ∥w0−w∗∥2 δ ⌉ iterations

  8. [9]

    Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

    By L∗ smoothness within the region of interest we havef(wt) − f∗ ≤ ϵ if we have L∗ 2 ∥wt − w∗∥2 ≤ ϵ for t ≥ tδ. Initialized at wtδ, using lemma 25 with¯L = L∗, we have thatL∗ 2 ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L∗ µ log L∗ 2 ∥wtδ −w∗∥2 ϵ ⌉ iterations

  9. [10]

    Since L 2 ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log L∗ L δ ϵ ⌉ additional iterations. D.2 Example of Polyak Step Size Increasing Function Value In this section, we give an example showing that gradient descent with the Polyak step size can increase the function value while decreasing the2-norm distance to the minimizer. Consider the functio...

  10. [12]

    It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

    Let tδ be the first iteration such that∥wtδ − w∗∥2 ≤ δ. It follows from the monotonicity of∥wt − w∗∥, that all iterateswt for t ≥ tδ stay in the region of interest

  11. [13]

    Initialized atw0, using Lemma 25 with¯L = L, GD(Polyak) is guaranteed to have∥wt − w∗∥2 ≤ δ after tδ = ⌈ 4L µ log ∥w0−w∗∥2 δ ⌉ iterations

  12. [14]

    Initialized at wtδ, using Lemma 25 with ¯L = L∗, we have that ∥wt − w∗∥2 ≤ ϵ after an additional t = ⌈ 4L µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

  13. [15]

    E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]

    Since ∥wtδ − w∗∥2 ≤ δ, it is sufficient to havet = ⌈ 4L∗ µ log δ ϵ ⌉ additional iterations. E AdGD Convergence Results In this section we give the proof of Theorem 11, which makes use of an existing convergence result for AdGD [Malitsky and Mishchenko, 2020, Theorem 2]. Lemma 27.Assume thatf is ¯L-smooth andµ-strongly convex. For allt > 2 the iterates of ...

  14. [17]

    Now let tδ = t ′ + 2

    Let t ′ be the first iteration such thatΦt′ ≤ δ and t ′ > 2. Now let tδ = t ′ + 2. Since ∥wt − w∗∥2 ≤ Φt for all t, it follows from the monotonicity ofΦt that all t ≥ tδ − 1 stay in the region of interest

  15. [18]

    Initialized at w0, using Lemma 27 with¯L = L, AdGD is guaranteed to have∥wtδ − w∗∥2 ≤ Φtδ ≤ δ after tδ = O L µ log Φ3 δ iterations

  16. [19]

    Initialized at wtδ, using Lemma 27 with ¯L = L∗, we have that∥wt − w∗∥2 ≤ Φt ≤ ϵ after another t = O L∗ µ log Φtδ ϵ iterations

  17. [20]

    26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions

    Since Φtδ ≤ δ, it is sufficient to havet = O L∗ µ log δ ϵ additional iterations. 26 F Convex Objectives e GD(LO) in convex settings, we require an iteration complexity of GD(LO) for smooth and convex functions. We give such a result next, following an argument similar to previous work [Beck and Tetruashvili, 2013, Section 3]. Lemma 28.Assume thatf is ¯L-s...

  18. [21]

    Initialized at w0, using Lemma 28 with¯L = L, GD(LO) is guaranteed to havef(wt) − f∗ ≤ δ after tδ = O L δ R2(f(w0)) iterations. F.2 Globally Convex and Locally Convex Proof of Theorem 13.We only need to specify Steps 4 and 5 as the other steps are identical to the analysis of GD(LO) in the globally convex and locally strongly-convex case:

  19. [22]

    Initialized atwtδ, using Corollary 8 Lemma 28 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after another t = O L∗ ϵ R2(f(wtδ)) iterations

  20. [23]

    G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17

    Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = O L∗ ϵ R2(δ + f∗) additional iterations. G Coordinate Descent Results In this section we give the proofs of Theorem 16 and Theorem 17. We first review a standard result regarding the convergence rate of random and greedy coordinate descent [Nesterov, 2012, Nutini et al., 2015]. Lemma 29. Assume that ...

  21. [25]

    Let ρ = 1 − µ dL where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (f(wt) − f∗ > λ(f(w0) − f∗)) ≤ E[f(wt)] − f∗ λ(f(w0) − f∗) ≤ ρt λ . (Using the convergence guarantee) Choosing λ = δ f(w0)−f∗ we obtain P (f(wt) − f∗ > δ) ≤ ρt[f(w0) − f∗] δ . ...

  22. [26]

    Initialized at w0, using Lemma 29 with ¯L = L, random coordinate descent satisfiesf(wt) − f∗ ≤ δ after tδ = ⌈ dL µ log f(w0)−f∗ δζ ⌉ iterations with probability1 − ζ

  23. [27]

    Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

    Define Xδ as the event thatf(wtδ) − f∗ ≤ δ. Initialized atwtδ, using Corollary 30 and Lemma 29 with ¯L = L∗, we have thatE[f(wt) | Xδ] − f∗ ≤ ϵ after another t = ⌈ dL∗ µ log f(wtδ )−f∗ ϵ ⌉ iterations

  24. [28]

    Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations

    Conditioned on Xδ, f(wtδ) − f∗ ≤ δ. Hence, it is sufficient to have t = ⌈ dL∗ µ log δ ϵ ⌉ additional iterations. Proof of Theorem 17.We follow the analysis structure outlined in Section 3.6:

  25. [29]

    Our region of interest under coordinate-wise glocal smoothness is{w|f(w) − f∗ ≤ δ}

  26. [30]

    It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

    Let tδ be the first iteration such thatf(wtδ) − f∗ ≤ δ. It follows from monotonicity off(wt) that all subsequent t stay in the region of interest

  27. [31]

    Initialized at w0, using Lemma 29 with¯L = L, coordinate descent with greedy coordinate selection is guaranteed to havef(wt) − f∗ ≤ δ after tδ = ⌈ L µ1 log f(w0)−f∗ δ ⌉ iterations

  28. [32]

    Initialized at wtδ, using Corollary 30 and Lemma 29 with¯L = L∗, we have thatf(wt) − f∗ ≤ ϵ after an additional t = ⌈ L∗ µ1 log f(wtδ )−f∗ ϵ ⌉ iterations

  29. [33]

    Using that f(wtδ) − f∗ ≤ δ, it is sufficient to havet = ⌈ L∗ µ1 log δ ϵ ⌉ additional iterations. 29 H Stochastic Gradient Descent Convergence Results The following result is implied by existing work [Mishkin, 2020, Theorem 9] on using a stochastic line search under the interpolation assumption. Earlier work [Vaswani et al., 2019, Theorem 1] also gives a s...

  30. [34]

    Our region of interest under glocal smoothness of the iterates is{w|∥w − w∗∥2 ≤ δ}

  31. [35]

    Let ρ = 1 − µ 2Lmax where we note that0 < ρ < 1. Following a similar argument to prior work [Konečný and Richtárik, 2017, Theorem 5], using Markov’s inequality we have for anyλ > 0, P (∥wt − w∗∥2 > λ∥w0 − w∗∥2) ≤ E[∥wt − w∗∥2] λ∥w0 − w∗∥2 ≤ ρt λ . Choosing λ = δ ∥w0−w∗∥2 we obtain P (∥wt − w∗∥2 > δ) ≤ ρt∥w0 − w∗∥2 δ . For a probabilityζ such that 0 < ζ < ...

  32. [36]

    Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

    Initialized at w0, using Lemma 31 with¯L = Lmax, stochastic gradient descent satisfies∥wt − w∗∥2 ≤ δ after tδ = l 2Lmax µ log ∥w0−w∗∥2 δζ m iterations with probability1 − ζ. Here we have used thatηmax ≥ 1 Lmax,∗ ≥ 1 Lmax

  33. [37]

    Define Xδ as the event that∥wtδ − w∗∥ ≤ δ. Initialized at wtδ, using Lemma 31 with ¯L = Lmax,∗, and noting that we assume ηmax ≥ 1 Lmax,∗ , we have that E[∥wT − w∗∥ | Xδ] ≤ ϵ after another t = ⌈ 2Lmax,∗ µ log ∥wtδ −w∗∥2 ϵ ⌉ iterations

  34. [38]

    T −1Y t=0 (1 − √ηt · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 , f(zT ) − f(w∗) ≤ ¯L µ

    Conditioned on Xδ, ∥wtδ − w∗∥2 ≤ δ. Hence, it is sufficient to havet = ⌈ 2Lmax,∗ µ log δ ϵ ⌉ additional iterations. 30 I Acceleration Convergence Results Lemma 32.If f is µ-strongly convex, then the maximum step-size satisfying the Armijo condition in eq. (14) is bounded as, sup n η > 0 : f(wt − η∇f(wt)) ≤ f(wt) − η 2 ∥∇f(wt)∥2 o ≤ 1 µ . Proof. Since f is...

  35. [39]

    Our region of interest under glocal smoothness is{w|f(w) − f∗ ≤ δ}

  36. [40]

    Let tδ be the firstt for which ht ≤ δ

    By proposition 34, we know that max {f(wt), f(yt)} ≤ L µ "t−1Y i=0 (1 − √ηi · µ) # f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 =: ht. Let tδ be the firstt for which ht ≤ δ. Since ht is monotone decreasing int, ht ≤ htδ ≤ δ for all t ≥ tδ. Thus, f(wt) − f∗, f(yt) − f∗ ≤ δ for all t ≥ tδ

  37. [41]

    (14) satisfies ηt ≥ 2/L for all t ∈ N

    Since f is globally L-smooth, the step-size selected by backtracking line-search on eq. (14) satisfies ηt ≥ 2/L for all t ∈ N. As a result, the sequenceht is controlled as, ht ≤ L µ 1 − r µ 2L t f(w0) − f(w∗) + µ 2 ∥w0 − w∗∥2 2 . We conclude that an upper-bound on the iteration complexity forwt, yt to enter and remain in the δ + f∗ sub-level set is given ...

  38. [42]

    It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set

    We have established thatf(yt) − f∗, f(wt) − f∗ ≤ δ for all t ≥ tδ. It only remains to show an improved bound on the step-sizes from line-search for allt ≥ tδ due to L∗-smoothness over theδ + f∗ sub-level set. In particular, we shall showηt ≥ 1 2L∗ for all t ≥ tδ. proposition 7 implies that for allt ≥ tδ and η ≤ 1/L∗, ˜wt+1(η) = yt − η∇f(yt), satisfies f( ...

  39. [43]

    33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21

    Combining the two complexities implies we require at most T ≤ s 2L µ log L f(w0) − f∗ + µ 2 ∥w0 − w∗∥2 µ · δ ! + s 2L∗ µ log µ · δ L · ϵ , iterations to compute anϵ sub-optimal point. 33 J Non-Linear Conjugate Gradient Convergence Results In this section we give the proof of Theorem 21. We rely on the following result regarding the performance of the PRP ...

  40. [44]

    Our region of interest under glocal smoothness and is{w | f(w) − f∗ ≤ δ}

  41. [45]

    From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

    Let tδ be any iteration wheref(wtδ) − f∗ ≤ δ. From the monotonicity in Lemma 35, it follows thatwt stays in the region of interest for subsequentt

  42. [46]

    From Lemma 35, PRP NLCG initialized atw0 is guaranteed to satisfy f(wt) − f∗ ≤ δ after tδ = 1 2 L µ 2 log f(w0)−f∗ δ iterations. 34 Within the local region, there are 3 different ways we could reach an accuracy ofϵ: (a) we might reach an accuracy of ϵ before the first reset occurs due to the global rate from Lemma 35, (b) we might reach an accuracy of ϵ a...

  43. [47]

    Initialized at wtδ, from Lemma 35 we require an additionalt = 1 2 L µ 2 log f(wtδ )−f∗ ϵ iterations to have f(wt) − f∗ ≤ ϵ

  44. [48]

    We now turn to case (b):

    Using that f(wtδ) ≤ δ, it is sufficient to havet = 1 2 L µ 2 log δ ϵ iterations. We now turn to case (b):

  45. [49]

    Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG

    In order to exploit the local quadratic property, we require at most(d −1) additional iterations beyond tδ. Thus, aftertδ +( d −1) iterations the algorithm is in in the quadratic region and has already reset so NLCG is equivalent to linear CG. Using Lemma 36 with¯L = L∗, PRP NLCG is guaranteed to satisfy f(wt) − f∗ ≤ ϵ after another t = 1 2 q L∗ µ log 4 L...

  46. [50]

    Using that f(wtδ) − f∗ ≤ δ and Lemma 35, we have f(wtδ+(d−1)) − f∗ ≤ 1−(µ/L)2 1+(µ/L)2 2(d−1) δ ≤ exp(−2(d−1)µ2/L2)δ. , wecanconcludethatitissufficienttohave t = 1 2 q L∗ µ log 4 L∗ µ 2 δ ϵ − (d − 1) q L∗ µ µ2 L2 additional iterations (the negative term reduces the iteration count based on the progress made before the reset, which we have omitted in the f...