Partially Lazy Gradient Descent for Smoothed Online Learning
Pith reviewed 2026-05-16 11:54 UTC · model grok-4.3
The pith
k-lazyGD achieves optimal dynamic regret O(sqrt((P_T+1)T)) for any laziness slack up to Theta(sqrt(T/P_T)).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
k-lazyGD achieves the optimal dynamic regret O(sqrt((P_T+1)T)) for any laziness slack k up to Theta(sqrt(T/P_T)), where P_T is the comparator path length. This formally connects the allowable laziness to the comparator's shifts, showing that k-lazyGD can retain the inherently small movements of lazy methods without compromising tracking ability.
What carries the argument
The laziness slack k in k-lazyGD, which delays gradient updates for k steps before applying them, analyzed through the follow-the-regularized-leader framework in the setting of hitting and movement costs.
Load-bearing premise
The environment changes are fully captured by the comparator path length P_T in the standard smoothed online convex optimization model with well-defined hitting and movement costs.
What would settle it
A concrete SOCO instance where dynamic regret for some k larger than Theta(sqrt(T/P_T)) grows strictly faster than O(sqrt((P_T+1)T)) would disprove the central claim.
Figures
read the original abstract
We introduce \textsc{$k$-lazyGD}, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k{=}1$) and lazy GD/dual-averaging (for $k{=}T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that \textsc{$k$-lazyGD} achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T{+}1)T})$ for any laziness slack $k$ up to $\Theta(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator's shifts, showing that \textsc{$k$-lazyGD} can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces k-lazyGD, a partially lazy variant of online gradient descent in the smoothed online convex optimization (SOCO) setting that incurs both hitting and movement costs. It claims that for any fixed laziness slack k up to Θ(√(T/P_T)), the algorithm attains the optimal dynamic regret O(√((P_T+1)T)) via an FTRL analysis, provides a matching lower bound, and removes dependence on the unknown path length P_T by running an ensemble over multiple k values.
Significance. If the central claims hold, the work establishes a continuous spectrum between fully greedy (k=1) and fully lazy (k=T) methods while preserving optimal dynamic regret in SOCO. This could inform the design of stable yet adaptive online algorithms when movement costs matter. The FTRL-based derivation and matching lower bound are positive features, but the ensemble step for unknown P_T must be shown to avoid extra logarithmic factors for the optimality claim to be fully realized.
major comments (3)
- [Abstract and §4] Abstract and §4 (ensemble construction): the headline claim that the ensemble over slacks yields a method with regret O(√((P_T+1)T)) without extra factors is not supported by an explicit regret analysis. Standard doubling or expert-advice aggregation over Θ(log T) candidate k values typically multiplies the base regret by O(log T) or O(log log T); if this occurs here, the exact optimality statement fails when P_T is unknown.
- [Theorem 1] Theorem 1 (or equivalent main result): the bound O(√((P_T+1)T)) is proved only for k chosen as a function of the (unknown) P_T. The transition from this oracle-dependent result to the ensemble meta-algorithm is not shown to preserve the same leading term; a concrete regret decomposition for the meta-algorithm is required.
- [§3] §3 (FTRL analysis): the derivation of the movement-cost term under partial laziness appears to rely on the standard FTRL potential; it is unclear whether the laziness slack k enters the analysis in a way that genuinely decouples from P_T or whether the bound reduces tautologically once k is set to Θ(√(T/P_T)).
minor comments (2)
- [§2] Notation for the hitting and movement costs should be introduced with explicit definitions before the regret statement; the current presentation assumes familiarity with the SOCO model without a self-contained recap.
- [Lower bound section] The lower-bound construction is referenced but its dependence on the laziness parameter k is not contrasted with the upper bound; a short table comparing the two would improve clarity.
Simulated Author's Rebuttal
Thank you for the careful review and constructive feedback. We address each major comment below. We agree that the ensemble analysis requires expansion and will add explicit regret decompositions and clarifications in the revision.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (ensemble construction): the headline claim that the ensemble over slacks yields a method with regret O(√((P_T+1)T)) without extra factors is not supported by an explicit regret analysis. Standard doubling or expert-advice aggregation over Θ(log T) candidate k values typically multiplies the base regret by O(log T) or O(log log T); if this occurs here, the exact optimality statement fails when P_T is unknown.
Authors: We thank the referee for this observation. The ensemble in §4 runs O(log T) instances over a dyadic grid of candidate k values (corresponding to possible P_T scales). The current manuscript sketches the meta-algorithm but does not provide a full regret decomposition. A standard expert-advice or doubling analysis would indeed introduce an extra O(√(log T)) or O(log T) factor on top of the base O(√((P_T+1)T)) term. We will add the missing decomposition in the revision and, if the factor cannot be removed, revise the abstract and main claims to reflect the precise bound. This is a substantive gap in the present write-up. revision: partial
-
Referee: [Theorem 1] Theorem 1 (or equivalent main result): the bound O(√((P_T+1)T)) is proved only for k chosen as a function of the (unknown) P_T. The transition from this oracle-dependent result to the ensemble meta-algorithm is not shown to preserve the same leading term; a concrete regret decomposition for the meta-algorithm is required.
Authors: Theorem 1 establishes the bound for a fixed k = Θ(√(T/P_T)). Section 4 then describes the ensemble over a geometric grid of k values. We will insert a new lemma in §4 that decomposes the meta-regret into (i) the regret of the best expert in the grid and (ii) the overhead incurred by the aggregation rule. The revised manuscript will state the resulting bound explicitly, including any additional factors. revision: yes
-
Referee: [§3] §3 (FTRL analysis): the derivation of the movement-cost term under partial laziness appears to rely on the standard FTRL potential; it is unclear whether the laziness slack k enters the analysis in a way that genuinely decouples from P_T or whether the bound reduces tautologically once k is set to Θ(√(T/P_T)).
Authors: In the FTRL analysis of §3 the algorithm performs a regularized update only every k steps, accumulating the gradient over the slack interval. The movement cost is therefore incurred only at update epochs, and the potential is applied to the summed gradient. The condition k ≤ Θ(√(T/P_T)) ensures that the comparator variation over each slack interval remains controlled relative to the total path length P_T; for larger k the tracking error grows. The bound is not tautological—the range on k is derived from balancing the hitting-cost and movement-cost terms. We will expand the proof with an explicit intermediate inequality showing the dependence on k. revision: yes
Circularity Check
Standard FTRL analysis with no reduction to fitted inputs or self-citation load-bearing
full rationale
The paper derives the O(sqrt((P_T+1)T)) dynamic regret bound for k-lazyGD by direct analysis within the standard Follow the Regularized Leader (FTRL) framework, as stated in the abstract. This is an external, well-established template in online convex optimization and does not reduce the claimed result to a quantity fitted from the target data by construction. The ensemble over slacks is invoked to handle unknown P_T, but the provided text gives no equations showing the meta-regret collapsing to the base bound or to a self-referential fit. No self-citation chains, ansatz smuggling, or renaming of known empirical patterns appear in the load-bearing steps. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- k
axioms (1)
- domain assumption Standard assumptions of smoothed online convex optimization including hitting and movement costs
Reference graph
Works this paper leans on
-
[1]
Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,
Tuo Liu, El Mehdi Saad, Wojciech Kot lowski, and Francesco Orabona. Dual averaging con- verges for nonconvex smooth stochastic optimiza- tion.arXiv:2505.21394,
-
[2]
Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J Weinberger. On sequential strategies for loss functions with memory.IEEE Transactions on Information Theory, 48(7):1947–1958,
work page 1947
-
[3]
A Modern Introduction to Online Learning
Francesco Orabona. A Modern Introduction to Online Learning.arXiv.1912.13213,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[4]
Input: compact setX, regularizationσ >0, laziness slackk∈N, horizonT. Output:{x t}T t=1. 1:Initializex 1 ∈ X, setp 1:0 .=0. 2:fort= 1,2, . . . , Tdo 3:Use actionx t. 4:f t(·)is revealed 5:Compute a subgradientg t ∈∂f t(xt). 6:Set the countern t .= (t−1) mod (k) 7:Compute the unconstrained FTRL centery t =− 1 σ p1:t−1. 8:Set the normal-cone correctiong I t...
work page 2017
-
[5]
Gaussian gradients with variance 10 per coordinate
We generatePphases with alternating mean directions µp ∈ {+1,−1}(starting positive), and sample within each phase i.i.d. Gaussian gradients with variance 10 per coordinate. Each sampled row is thenℓ 2-normalized (i.e., projected Gaussian). Concatenating thePphases yields a horizon of lengthT tot .=P T phase. This construction is interesting because it is ...
work page 2022
-
[6]
We consider linear costs ft(x) .=⟨g t,x⟩,g t ∈[−1,1] 5,∥g t∥2 = 1, with decisions constrained to theℓ 1 unit ball X .={x∈R 5 :∥x∥ 1 ≤1}. Sequence generation.In each phase, gradients are predominantly aligned with one sign for long stretches (here of length 100), but are interspersed with short bursts of the opposite sign (length 10). The phase length is 2...
work page 2000
-
[7]
Moreover, the losses within an active interval differ slightly for the learner and the comparator
Moreover, since the sequence starts with the positive half, the sum over any window is non-negative: jX t=1 gt ≥0,∀j∈[T]. Moreover, the losses within an active interval differ slightly for the learner and the comparator. Specifically, the learner incurs− k−1 2 , while the comparator incurs−(k−1), since the final step contributes zero cost. Hence, the regr...
work page 2012
-
[8]
Moreover, by properties of Euclidean projection, the switching cost remains bounded in theℓ 2 norm, which in turn controls all other norms by equivalence of norms. F.2 Relation to other Phased Updates The phased structure of the update rule ink-lazyGDis reminiscent of blocking methods, which also operate in phases (Merhav et al., 2002; Chen et al., 2020)....
work page 2002
-
[9]
The inequality holds because of the update rule for eachx t+1 for anyt
! = TX t=1 h0:t(xt)− TX t=1 h0:t(xt+1)− T−1X t=1 h0:t(ut)− T−1X t=1 h0:t(ut+1 ! +r(u 1). The inequality holds because of the update rule for eachx t+1 for anyt. I.e., h0:T (xT+1 )≤h 0:T (uT ). Also,h 0(u1) =r(u 1). Writingh t of the LHS explicitly, and adding the total switching cost to both sides, we get TX t=1 ⟨pt,x t −u t⟩+ T−1X t=1 ∥xt −x t−1∥ ≤ TX t=...
work page 2017
-
[10]
and hence (24) becomes 0∈ N X (xt), which always holds. In both cases,x t satisfies the optimality condition for minimizingh 0:t−1(x) +⟨g I t ,x⟩overX, and is therefore a valid minimizer. G.2 Proof of Lemma7 Lemma 7Under the Lipschitzness of the loss functions and compactness of the domain,k-lazyGDiterates guarantee the following bounds on each part of th...
work page 2007
-
[11]
Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T
For any sequence of convex losses and comparators, the dynamic regret with switching cost of k-lazyGDsatisfies RT ≤ TX t=1 min Gδt, G2 2σ + (2Rσ+kG)P T + σR2 2 + TX t=1 min(δt, G σ ), whereδ t .=∥x t+1 −xt∥. Moreover, settingk=⌊c p T /PT ⌋, c >0 andσ=σ ⋆ .= p (G2+2G)T /(4RP T +R2) gives RT =O p (PT + 1)T . Proof.Starting from the result of Lemma 6, and su...
work page 2021
-
[12]
For any comparator sequence, running the meta-learner of (Zhang et al., 2021, “SAder”) over a set ofk-lazyGDexperts fromH, guarantees RT =O p (PT + 1)T . Proof.(ofTheorem
work page 2021
-
[13]
2 and hence ln(1/wσs 1 )≤2 ln(s+ 1)≤2 ln(log( p 1 + 6PT/R+ 1)) I Additional Details on Laziness Advantages I.1 Variance-based Switching Cost Beyond the stability properties formalized in Propositions 3 and 4, laziness also yields a variance-based switching bound. unlike those results, this bound requires no additional assumptions on the domain or on the m...
work page 2010
-
[14]
Remarks.For the fully lazy case (k=T), the result shows that switching can be bounded as Θ ∥gt − 1 t g1:t∥+ 1 t , namely in terms of the deviation of the most recent gradient from the running mean, rather than its raw magnitude. In thek-lazy case, the same principle applies phase by phase, with the bound expressed relative to the average within each lazy ...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.