Partially Lazy Gradient Descent for Smoothed Online Learning

George Iosifidis; Naram Mhaisen

arxiv: 2601.15984 · v3 · submitted 2026-01-22 · 💻 cs.LG

Partially Lazy Gradient Descent for Smoothed Online Learning

Naram Mhaisen , George Iosifidis This is my paper

Pith reviewed 2026-05-16 11:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords online convex optimizationdynamic regretgradient descentlazy updatessmoothed online learningfollow the regularized leaderpath length

0 comments

The pith

k-lazyGD achieves optimal dynamic regret O(sqrt((P_T+1)T)) for any laziness slack up to Theta(sqrt(T/P_T)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces k-lazyGD to create a tunable spectrum between fully reactive gradient descent and fully lazy updates in smoothed online convex optimization. It establishes that this partial laziness preserves the optimal dynamic regret bound whenever the slack parameter stays below a threshold set by the comparator's total path length. This matters because it shows that learners can favor smoother, more stable updates when the environment permits without losing the ability to track shifting comparators. The result is derived in the follow-the-regularized-leader framework and paired with a matching lower bound. An ensemble of instances with different slacks produces a practical method that stays stable when possible and agile when required.

Core claim

k-lazyGD achieves the optimal dynamic regret O(sqrt((P_T+1)T)) for any laziness slack k up to Theta(sqrt(T/P_T)), where P_T is the comparator path length. This formally connects the allowable laziness to the comparator's shifts, showing that k-lazyGD can retain the inherently small movements of lazy methods without compromising tracking ability.

What carries the argument

The laziness slack k in k-lazyGD, which delays gradient updates for k steps before applying them, analyzed through the follow-the-regularized-leader framework in the setting of hitting and movement costs.

Load-bearing premise

The environment changes are fully captured by the comparator path length P_T in the standard smoothed online convex optimization model with well-defined hitting and movement costs.

What would settle it

A concrete SOCO instance where dynamic regret for some k larger than Theta(sqrt(T/P_T)) grows strictly faster than O(sqrt((P_T+1)T)) would disprove the central claim.

Figures

Figures reproduced from arXiv: 2601.15984 by George Iosifidis, Naram Mhaisen.

**Figure 1.** Figure 1: Switching in example (i, top), showing staleness, and example (ii,bottom), showing stability. Left: switching cost. Right: Snapshots over 4 (top) and 2 (bottom) rounds: greedy updates move continuously, whereas lazy updates remain still or move minimally. correlation, as in this example, the movement of the lazy variant is greatly reduced. As in Example (i), chasing the most recent gradient yields no impr… view at source ↗

**Figure 3.** Figure 3: Geometric intuition for the effect of lazy vs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Switching and hitting cost in Example (i). Example (ii). We consider a 2-dimensional test with horizon T = 101. The cost sequence g1, . . . , gT ∈ R 2 is axis-aligned: gt = (−1, 0) for t = 1, . . . , 11, and for t = 12, . . . , 101 we set gt = (−1,(−1)t ), i.e. the second coordinate alternates in sign producing (−1, 1),(−1, −1), . . .. This yields a piecewise-constant initial segment followed by rapid vert… view at source ↗

**Figure 5.** Figure 5: Switching and hitting cost in Example (ii). we expect the switching cost to be empirically comparable to (LazyGD) and the hitting cost close to (GD), leading to overall performance that improves upon the standard SOCO baseline (GD). For each algorithm, we report three quantities: switching cost: X T t=1 ∥xt − xt−1∥, dynamic regret: X T t=1 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Example cost sequence. 1 coordinate, 2 phases of length 20. Few samples of each phase and their [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Shifting stochastic sequences. (a) Switching cost, (b) hitting cost, and (c) total regret for GD, LazyGD, and k-LazyGD with different k. D.2 Corrupted Sequences We next consider a deterministic counterpart to the stochastic phases above. Gradients gt are normalized to satisfy ∥gt∥2 = 1, so G .= 1. We consider linear costs ft(x) .= ⟨gt, x⟩, gt ∈ [−1, 1]5 , ∥gt∥2 = 1, with decisions constrained to the ℓ1 uni… view at source ↗

**Figure 8.** Figure 8: Example cost sequence. 1 coordinate, 2 phases of length 20. Red ‘x’ is the corruption burst. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Corrupted phases. (a) Switching cost, (b) dynamic (hitting) regret, and (c) total regret for GD, LazyGD, and k-LazyGD with different k. D.3 Takeaway from Experiments The simulations highlight three key observations: (i) k-lazyGD can achieve substantial reductions in switching cost across different choices of k (see parts (a)). (ii) In many cases, these savings come with little deterioration in hitting cost… view at source ↗

**Figure 10.** Figure 10: Worst-case for k-lazyGD methods: every switch is the beginning of a new phase: no outliers due to randomness or deterministically. (a) Switching cost, (b) dynamic (hitting) regret, and (c) total regret for GD, LazyGD, and k-LazyGD with different k. Note that the existence of such sequences does not undermine the findings: the ensemble framework can already include the greedy update as the special case k =… view at source ↗

**Figure 11.** Figure 11: Geometric illustration for Prop. 3. Remarks. For the fully lazy case (k = T), the result shows that switching can be bounded as Θ [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

read the original abstract

We introduce \textsc{$k$-lazyGD}, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k{=}1$) and lazy GD/dual-averaging (for $k{=}T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that \textsc{$k$-lazyGD} achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T{+}1)T})$ for any laziness slack $k$ up to $\Theta(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator's shifts, showing that \textsc{$k$-lazyGD} can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

k-lazyGD lets you tune laziness in SOCO while keeping optimal dynamic regret up to k ~ sqrt(T/P_T), but the ensemble for unknown P_T probably adds a log factor that weakens the exact optimality claim.

read the letter

The paper introduces k-lazyGD, which sits between plain OGD (k=1) and fully lazy dual averaging (k=T). It proves that you can keep updates lazy by a factor k without hurting the optimal dynamic regret O(sqrt((P_T+1)T)) in smoothed online convex optimization, as long as k stays at most Theta(sqrt(T/P_T)). The bound links the allowed laziness directly to the comparator path length, which is the cleanest part of the work.

Referee Report

3 major / 2 minor

Summary. The paper introduces k-lazyGD, a partially lazy variant of online gradient descent in the smoothed online convex optimization (SOCO) setting that incurs both hitting and movement costs. It claims that for any fixed laziness slack k up to Θ(√(T/P_T)), the algorithm attains the optimal dynamic regret O(√((P_T+1)T)) via an FTRL analysis, provides a matching lower bound, and removes dependence on the unknown path length P_T by running an ensemble over multiple k values.

Significance. If the central claims hold, the work establishes a continuous spectrum between fully greedy (k=1) and fully lazy (k=T) methods while preserving optimal dynamic regret in SOCO. This could inform the design of stable yet adaptive online algorithms when movement costs matter. The FTRL-based derivation and matching lower bound are positive features, but the ensemble step for unknown P_T must be shown to avoid extra logarithmic factors for the optimality claim to be fully realized.

major comments (3)

[Abstract and §4] Abstract and §4 (ensemble construction): the headline claim that the ensemble over slacks yields a method with regret O(√((P_T+1)T)) without extra factors is not supported by an explicit regret analysis. Standard doubling or expert-advice aggregation over Θ(log T) candidate k values typically multiplies the base regret by O(log T) or O(log log T); if this occurs here, the exact optimality statement fails when P_T is unknown.
[Theorem 1] Theorem 1 (or equivalent main result): the bound O(√((P_T+1)T)) is proved only for k chosen as a function of the (unknown) P_T. The transition from this oracle-dependent result to the ensemble meta-algorithm is not shown to preserve the same leading term; a concrete regret decomposition for the meta-algorithm is required.
[§3] §3 (FTRL analysis): the derivation of the movement-cost term under partial laziness appears to rely on the standard FTRL potential; it is unclear whether the laziness slack k enters the analysis in a way that genuinely decouples from P_T or whether the bound reduces tautologically once k is set to Θ(√(T/P_T)).

minor comments (2)

[§2] Notation for the hitting and movement costs should be introduced with explicit definitions before the regret statement; the current presentation assumes familiarity with the SOCO model without a self-contained recap.
[Lower bound section] The lower-bound construction is referenced but its dependence on the laziness parameter k is not contrasted with the upper bound; a short table comparing the two would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the careful review and constructive feedback. We address each major comment below. We agree that the ensemble analysis requires expansion and will add explicit regret decompositions and clarifications in the revision.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (ensemble construction): the headline claim that the ensemble over slacks yields a method with regret O(√((P_T+1)T)) without extra factors is not supported by an explicit regret analysis. Standard doubling or expert-advice aggregation over Θ(log T) candidate k values typically multiplies the base regret by O(log T) or O(log log T); if this occurs here, the exact optimality statement fails when P_T is unknown.

Authors: We thank the referee for this observation. The ensemble in §4 runs O(log T) instances over a dyadic grid of candidate k values (corresponding to possible P_T scales). The current manuscript sketches the meta-algorithm but does not provide a full regret decomposition. A standard expert-advice or doubling analysis would indeed introduce an extra O(√(log T)) or O(log T) factor on top of the base O(√((P_T+1)T)) term. We will add the missing decomposition in the revision and, if the factor cannot be removed, revise the abstract and main claims to reflect the precise bound. This is a substantive gap in the present write-up. revision: partial
Referee: [Theorem 1] Theorem 1 (or equivalent main result): the bound O(√((P_T+1)T)) is proved only for k chosen as a function of the (unknown) P_T. The transition from this oracle-dependent result to the ensemble meta-algorithm is not shown to preserve the same leading term; a concrete regret decomposition for the meta-algorithm is required.

Authors: Theorem 1 establishes the bound for a fixed k = Θ(√(T/P_T)). Section 4 then describes the ensemble over a geometric grid of k values. We will insert a new lemma in §4 that decomposes the meta-regret into (i) the regret of the best expert in the grid and (ii) the overhead incurred by the aggregation rule. The revised manuscript will state the resulting bound explicitly, including any additional factors. revision: yes
Referee: [§3] §3 (FTRL analysis): the derivation of the movement-cost term under partial laziness appears to rely on the standard FTRL potential; it is unclear whether the laziness slack k enters the analysis in a way that genuinely decouples from P_T or whether the bound reduces tautologically once k is set to Θ(√(T/P_T)).

Authors: In the FTRL analysis of §3 the algorithm performs a regularized update only every k steps, accumulating the gradient over the slack interval. The movement cost is therefore incurred only at update epochs, and the potential is applied to the summed gradient. The condition k ≤ Θ(√(T/P_T)) ensures that the comparator variation over each slack interval remains controlled relative to the total path length P_T; for larger k the tracking error grows. The bound is not tautological—the range on k is derived from balancing the hitting-cost and movement-cost terms. We will expand the proof with an explicit intermediate inequality showing the dependence on k. revision: yes

Circularity Check

0 steps flagged

Standard FTRL analysis with no reduction to fitted inputs or self-citation load-bearing

full rationale

The paper derives the O(sqrt((P_T+1)T)) dynamic regret bound for k-lazyGD by direct analysis within the standard Follow the Regularized Leader (FTRL) framework, as stated in the abstract. This is an external, well-established template in online convex optimization and does not reduce the claimed result to a quantity fitted from the target data by construction. The ensemble over slacks is invoked to handle unknown P_T, but the provided text gives no equations showing the meta-regret collapsing to the base bound or to a self-referential fit. No self-citation chains, ansatz smuggling, or renaming of known empirical patterns appear in the load-bearing steps. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of smoothed online convex optimization and the FTRL framework; no new entities are introduced.

free parameters (1)

k
Laziness slack parameter chosen up to Theta(sqrt(T/P_T)) based on comparator path length

axioms (1)

domain assumption Standard assumptions of smoothed online convex optimization including hitting and movement costs
Invoked throughout the regret analysis in the abstract

pith-pipeline@v0.9.0 · 5525 in / 1216 out tokens · 72277 ms · 2026-05-16T11:54:55.317948+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,

Tuo Liu, El Mehdi Saad, Wojciech Kot lowski, and Francesco Orabona. Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,

work page arXiv
[2]

On sequential strategies for loss functions with memory.IEEE Transactions on Information Theory, 48(7):1947–1958,

Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory.IEEE Transactions on Information Theory, 48(7):1947–1958,

work page 1947
[3]

A Modern Introduction to Online Learning

Francesco Orabona. A Modern Introduction to Online Learning.arXiv.1912.13213,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[4]

Output:{x t}T t=1

Input: compact setX, regularizationσ >0, laziness slackk∈N, horizonT. Output:{x t}T t=1. 1:Initializex 1 ∈ X, setp 1:0 .=0. 2:fort= 1,2, . . . , Tdo 3:Use actionx t. 4:f t(·)is revealed 5:Compute a subgradientg t ∈∂f t(xt). 6:Set the countern t .= (t−1) mod (k) 7:Compute the unconstrained FTRL centery t =− 1 σ p1:t−1. 8:Set the normal-cone correctiong I t...

work page 2017
[5]

Gaussian gradients with variance 10 per coordinate

We generatePphases with alternating mean directions µp ∈ {+1,−1}(starting positive), and sample within each phase i.i.d. Gaussian gradients with variance 10 per coordinate. Each sampled row is thenℓ 2-normalized (i.e., projected Gaussian). Concatenating thePphases yields a horizon of lengthT tot .=P T phase. This construction is interesting because it is ...

work page 2022
[6]

We consider linear costs ft(x) .=⟨g t,x⟩,g t ∈[−1,1] 5,∥g t∥2 = 1, with decisions constrained to theℓ 1 unit ball X .={x∈R 5 :∥x∥ 1 ≤1}. Sequence generation.In each phase, gradients are predominantly aligned with one sign for long stretches (here of length 100), but are interspersed with short bursts of the opposite sign (length 10). The phase length is 2...

work page 2000
[7]

Moreover, the losses within an active interval differ slightly for the learner and the comparator

Moreover, since the sequence starts with the positive half, the sum over any window is non-negative: jX t=1 gt ≥0,∀j∈[T]. Moreover, the losses within an active interval differ slightly for the learner and the comparator. Specifically, the learner incurs− k−1 2 , while the comparator incurs−(k−1), since the final step contributes zero cost. Hence, the regr...

work page 2012
[8]

blocking argument

Moreover, by properties of Euclidean projection, the switching cost remains bounded in theℓ 2 norm, which in turn controls all other norms by equivalence of norms. F.2 Relation to other Phased Updates The phased structure of the update rule ink-lazyGDis reminiscent of blocking methods, which also operate in phases (Merhav et al., 2002; Chen et al., 2020)....

work page 2002
[9]

The inequality holds because of the update rule for eachx t+1 for anyt

! = TX t=1 h0:t(xt)− TX t=1 h0:t(xt+1)− T−1X t=1 h0:t(ut)− T−1X t=1 h0:t(ut+1 ! +r(u 1). The inequality holds because of the update rule for eachx t+1 for anyt. I.e., h0:T (xT+1 )≤h 0:T (uT ). Also,h 0(u1) =r(u 1). Writingh t of the LHS explicitly, and adding the total switching cost to both sides, we get TX t=1 ⟨pt,x t −u t⟩+ T−1X t=1 ∥xt −x t−1∥ ≤ TX t=...

work page 2017
[10]

In both cases,x t satisfies the optimality condition for minimizingh 0:t−1(x) +⟨g I t ,x⟩overX, and is therefore a valid minimizer

and hence (24) becomes 0∈ N X (xt), which always holds. In both cases,x t satisfies the optimality condition for minimizingh 0:t−1(x) +⟨g I t ,x⟩overX, and is therefore a valid minimizer. G.2 Proof of Lemma7 Lemma 7Under the Lipschitzness of the loss functions and compactness of the domain,k-lazyGDiterates guarantee the following bounds on each part of th...

work page 2007
[11]

Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T

For any sequence of convex losses and comparators, the dynamic regret with switching cost of k-lazyGDsatisfies RT ≤ TX t=1 min Gδt, G2 2σ + (2Rσ+kG)P T + σR2 2 + TX t=1 min(δt, G σ ), whereδ t .=∥x t+1 −xt∥. Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T . Proof.Starting from the result of Lemma 6, and su...

work page 2021
[12]

Proof.(ofTheorem

For any comparator sequence, running the meta-learner of (Zhang et al., 2021, “SAder”) over a set ofk-lazyGDexperts fromH, guarantees RT =O p (PT + 1)T . Proof.(ofTheorem

work page 2021
[13]

unlike those results, this bound requires no additional assumptions on the domain or on the magni- tude/direction of the accumulated gradients

2 and hence ln(1/wσs 1 )≤2 ln(s+ 1)≤2 ln(log( p 1 + 6PT/R+ 1)) I Additional Details on Laziness Advantages I.1 Variance-based Switching Cost Beyond the stability properties formalized in Propositions 3 and 4, laziness also yields a variance-based switching bound. unlike those results, this bound requires no additional assumptions on the domain or on the m...

work page 2010
[14]

In thek-lazy case, the same principle applies phase by phase, with the bound expressed relative to the average within each lazy block

Remarks.For the fully lazy case (k=T), the result shows that switching can be bounded as Θ ∥gt − 1 t g1:t∥+ 1 t , namely in terms of the deviation of the most recent gradient from the running mean, rather than its raw magnitude. In thek-lazy case, the same principle applies phase by phase, with the bound expressed relative to the average within each lazy ...

work page 2010

[1] [1]

Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,

Tuo Liu, El Mehdi Saad, Wojciech Kot lowski, and Francesco Orabona. Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,

work page arXiv

[2] [2]

On sequential strategies for loss functions with memory.IEEE Transactions on Information Theory, 48(7):1947–1958,

Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory.IEEE Transactions on Information Theory, 48(7):1947–1958,

work page 1947

[3] [3]

A Modern Introduction to Online Learning

Francesco Orabona. A Modern Introduction to Online Learning.arXiv.1912.13213,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[4] [4]

Output:{x t}T t=1

Input: compact setX, regularizationσ >0, laziness slackk∈N, horizonT. Output:{x t}T t=1. 1:Initializex 1 ∈ X, setp 1:0 .=0. 2:fort= 1,2, . . . , Tdo 3:Use actionx t. 4:f t(·)is revealed 5:Compute a subgradientg t ∈∂f t(xt). 6:Set the countern t .= (t−1) mod (k) 7:Compute the unconstrained FTRL centery t =− 1 σ p1:t−1. 8:Set the normal-cone correctiong I t...

work page 2017

[5] [5]

Gaussian gradients with variance 10 per coordinate

We generatePphases with alternating mean directions µp ∈ {+1,−1}(starting positive), and sample within each phase i.i.d. Gaussian gradients with variance 10 per coordinate. Each sampled row is thenℓ 2-normalized (i.e., projected Gaussian). Concatenating thePphases yields a horizon of lengthT tot .=P T phase. This construction is interesting because it is ...

work page 2022

[6] [6]

We consider linear costs ft(x) .=⟨g t,x⟩,g t ∈[−1,1] 5,∥g t∥2 = 1, with decisions constrained to theℓ 1 unit ball X .={x∈R 5 :∥x∥ 1 ≤1}. Sequence generation.In each phase, gradients are predominantly aligned with one sign for long stretches (here of length 100), but are interspersed with short bursts of the opposite sign (length 10). The phase length is 2...

work page 2000

[7] [7]

Moreover, the losses within an active interval differ slightly for the learner and the comparator

Moreover, since the sequence starts with the positive half, the sum over any window is non-negative: jX t=1 gt ≥0,∀j∈[T]. Moreover, the losses within an active interval differ slightly for the learner and the comparator. Specifically, the learner incurs− k−1 2 , while the comparator incurs−(k−1), since the final step contributes zero cost. Hence, the regr...

work page 2012

[8] [8]

blocking argument

Moreover, by properties of Euclidean projection, the switching cost remains bounded in theℓ 2 norm, which in turn controls all other norms by equivalence of norms. F.2 Relation to other Phased Updates The phased structure of the update rule ink-lazyGDis reminiscent of blocking methods, which also operate in phases (Merhav et al., 2002; Chen et al., 2020)....

work page 2002

[9] [9]

The inequality holds because of the update rule for eachx t+1 for anyt

! = TX t=1 h0:t(xt)− TX t=1 h0:t(xt+1)− T−1X t=1 h0:t(ut)− T−1X t=1 h0:t(ut+1 ! +r(u 1). The inequality holds because of the update rule for eachx t+1 for anyt. I.e., h0:T (xT+1 )≤h 0:T (uT ). Also,h 0(u1) =r(u 1). Writingh t of the LHS explicitly, and adding the total switching cost to both sides, we get TX t=1 ⟨pt,x t −u t⟩+ T−1X t=1 ∥xt −x t−1∥ ≤ TX t=...

work page 2017

[10] [10]

In both cases,x t satisfies the optimality condition for minimizingh 0:t−1(x) +⟨g I t ,x⟩overX, and is therefore a valid minimizer

and hence (24) becomes 0∈ N X (xt), which always holds. In both cases,x t satisfies the optimality condition for minimizingh 0:t−1(x) +⟨g I t ,x⟩overX, and is therefore a valid minimizer. G.2 Proof of Lemma7 Lemma 7Under the Lipschitzness of the loss functions and compactness of the domain,k-lazyGDiterates guarantee the following bounds on each part of th...

work page 2007

[11] [11]

Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T

For any sequence of convex losses and comparators, the dynamic regret with switching cost of k-lazyGDsatisfies RT ≤ TX t=1 min Gδt, G2 2σ + (2Rσ+kG)P T + σR2 2 + TX t=1 min(δt, G σ ), whereδ t .=∥x t+1 −xt∥. Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T . Proof.Starting from the result of Lemma 6, and su...

work page 2021

[12] [12]

Proof.(ofTheorem

For any comparator sequence, running the meta-learner of (Zhang et al., 2021, “SAder”) over a set ofk-lazyGDexperts fromH, guarantees RT =O p (PT + 1)T . Proof.(ofTheorem

work page 2021

[13] [13]

unlike those results, this bound requires no additional assumptions on the domain or on the magni- tude/direction of the accumulated gradients

2 and hence ln(1/wσs 1 )≤2 ln(s+ 1)≤2 ln(log( p 1 + 6PT/R+ 1)) I Additional Details on Laziness Advantages I.1 Variance-based Switching Cost Beyond the stability properties formalized in Propositions 3 and 4, laziness also yields a variance-based switching bound. unlike those results, this bound requires no additional assumptions on the domain or on the m...

work page 2010

[14] [14]

In thek-lazy case, the same principle applies phase by phase, with the bound expressed relative to the average within each lazy block

Remarks.For the fully lazy case (k=T), the result shows that switching can be bounded as Θ ∥gt − 1 t g1:t∥+ 1 t , namely in terms of the deviation of the most recent gradient from the running mean, rather than its raw magnitude. In thek-lazy case, the same principle applies phase by phase, with the bound expressed relative to the average within each lazy ...

work page 2010