pith. sign in

arxiv: 2511.07767 · v2 · submitted 2025-11-11 · 💻 cs.LG

Taking the Road Less Scheduled with Adaptive Polyak Steps

Pith reviewed 2026-05-17 23:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords schedule-free optimizationPolyak step sizesadaptive learning rateslast-iterate convergenceconvex optimizationSGDAdam
0
0 comments X

The pith

Polyak step sizes for schedule-free SGD and Adam compute learning rates from loss and gradient values to achieve optimal convergence without knowing the training horizon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives Polyak-type step sizes that set the learning rate at each step using only the current loss, gradient, and iterate for both schedule-free SGD and Adam. An oracle version that knows per-sample optimal function values is shown to deliver an O(1/sqrt(t)) last-iterate rate on convex Lipschitz problems. A practical safeguarded version replaces those optimal values with any available lower bound and still reaches the same rate up to a small neighborhood that disappears when the problem satisfies interpolation. Both variants reduce to the usual Polyak rules for ordinary SGD when momentum is turned off. This matters because it removes the need to choose a base learning rate whose best value depends on unknown problem constants or to pick a schedule in advance.

Core claim

We derive Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. The oracle variant uses per-sample optimal function values and proves an O(1/sqrt(t)) anytime last-iterate rate for convex Lipschitz objectives. The safeguarded variant replaces the unknown optimal values with any available lower bound, achieving the same rate up to a neighborhood that vanishes under interpolation. Both step sizes reduce to existing Polyak rules for standard SGD when momentum is set to zero.

What carries the argument

The safeguarded Polyak step size, which replaces unknown per-sample optimal function values with any available lower bound on the optimal objective value while preserving the convergence guarantee.

If this is right

  • The new step sizes deliver the optimal anytime last-iterate rate without requiring a pre-specified training horizon or schedule.
  • Performance remains comparable to well-tuned schedule-free baselines on language modeling tasks while requiring less hyperparameter adjustment.
  • Setting momentum to zero recovers the classical Polyak step-size rule for ordinary SGD, providing a single framework for both scheduled and schedule-free cases.
  • The neighborhood term vanishes automatically under the interpolation condition common in over-parameterized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction could be tested on non-convex deep-learning objectives to see whether last-iterate stability improves in practice.
  • Because the step size depends only on per-sample quantities, it may combine naturally with variance-reduction techniques that already track individual losses.
  • The unification with standard Polyak rules suggests that existing convergence proofs for Polyak SGD could be reused with only minor changes for the schedule-free setting.

Load-bearing premise

The objective must be convex and Lipschitz continuous, and a usable lower bound on the optimal value must be supplied for the safeguarded version.

What would settle it

Run the safeguarded method on a convex Lipschitz problem that does not satisfy interpolation and check whether the neighborhood around the optimum fails to shrink to zero as iterations increase.

Figures

Figures reproduced from arXiv: 2511.07767 by Dimitris Oikonomou, Matthew Buchholz, Nicolas Loizou, Robert M. Gower, Yuen-Man Pun.

Figure 1
Figure 1. Figure 1: Our theory (Theorem 2.1) is good at predicting the behavior of the training loss: The [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss for schedule-free on ResNet-20 /Cifar10 with a constant learning rate schedule (gray), warmup-stable (red￾gray), and wsd schedule (red-gray-blue). The schedule-free algorithm was designed to per￾form well without the need to tune additional hyper￾parameters beyond momentum. For convex and Lip￾schitz objectives, it achieves the optimal convergence guarantees with a constant step size. Despite … view at source ↗
Figure 3
Figure 3. Figure 3: Using wsd schedules with three different cooldown periods and with base learning rate γ = 0.01, our plots compare the theoretical convergence (Theorem 2.1) to the empirical convergence of ResNet-20/Cifar10, with the gradient norm shown for reference. The red color denotes the warmup period, the gray color denotes the constant period, and the blue color denotes the cooldown period. We take x⋆ to be the iter… view at source ↗
Figure 4
Figure 4. Figure 4: The averaging parameter ct when applied with the wsd schedule where blue is our proposed ct = ηt/ Pt i=0 ηi , gray is ct = 1/t, and the orange is the practical heuristic ct = η 2 t / Pt i=0 η 2 i . Substituting the wsd schedule (8) into (6), we can obtain a sequence of averaging parameters. Lemma 2.2. Let 0 ≤ Tw ≤ Tc ≤ T and γ > 0. Suppose that {ηt} T t=0 follows the wsd schedule given in (8). We can deter… view at source ↗
Figure 5
Figure 5. Figure 5: Using schedules with three three different diverging periods, we compare the theoretical [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training a Wide ResNet (16-8) model on the CIFAR10 data set. The results in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training a smaller student model on the tiny shakespeare data set, using gpt2-medium as the teacher. 10 4 10 3 10 2 10 1 Learning Rate 3 4 5 6 7 8 9 10 Final Validation Loss adamw-schedulep adamw-schedulefree iams-adam adamw teacher 0.0 0.2 0.4 0.6 0.8 1.0 Epochs 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 Loss adamw-schedulep lr=0.0010 adamw-schedulefree lr=0.0010 iams-adam lr=0.0010 adamw lr=0.0050 [PITH_FULL_IMAGE… view at source ↗
Figure 8
Figure 8. Figure 8: Training a nanoGPT student model on the fineweb1B data set, using EleutherAI/gpt-j-6B as the teacher. learning rate γmax, but it is not quite as robust as the IAMS-Adam method is to the choice of learning rate. Distilling fineweb1B. The teacher model employed was EleutherAI/gpt-j-6B, a 6- billion parameter transformer model pre-trained on diverse datasets (Wang & Komatsuzaki, 2021). We used a nanoGPT model… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between the convex theory and the training loss for [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training a DenseNet model on the CIFAR100 data set [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Schedule-Free SGD, proposed in [Defazio et al., 2024], achieves optimal convergence rates without requiring the training horizon in advance, by replacing learning rate schedules with a principled form of iterate averaging. However, the method still requires tuning a base learning rate whose optimal value depends on unknown problem constants. In this work, we continue down this road by deriving Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. We first propose an oracle variant that uses per-sample optimal function values and prove an $O(1/\sqrt{t})$ anytime last-iterate rate for convex Lipschitz objectives. We then remove the oracle requirement with a safeguarded variant that replaces the unknown optimal values with any available lower bound, achieving the same rate up to a neighborhood that vanishes under interpolation. Both step sizes reduce to existing Polyak rules for standard SGD when momentum is set to zero, unifying standard and schedule-free Polyak methods. Numerical experiments on language modeling, including pretraining and distillation, show that the proposed methods match or surpass tuned Schedule-Free baselines while offering greater robustness to hyperparameter choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes adaptive Polyak-type step sizes for Schedule-Free SGD and Adam that are computed from per-iteration loss, gradient, and iterate information. An oracle variant using exact per-sample optimal values f_i^* is shown to achieve an O(1/sqrt(t)) anytime last-iterate rate for convex Lipschitz objectives. A safeguarded variant replaces f_i^* with any lower bound L_i and retains the same rate up to an additive neighborhood whose size is controlled by (f_i(x_t) - L_i) and vanishes under interpolation. Both variants reduce to classical Polyak steps for standard SGD when momentum is zero. Experiments on language modeling pretraining and distillation indicate that the methods match or exceed tuned Schedule-Free baselines with improved hyperparameter robustness.

Significance. If the proofs are complete, the work supplies a parameter-free adaptive mechanism that removes the need to tune a base learning rate in schedule-free frameworks while preserving optimal rates and unifying them with existing Polyak methods. The last-iterate guarantees and the explicit handling of lower bounds are technically useful; the empirical robustness on large-scale language tasks adds practical value. Strengths include the machine-checked-style reduction to prior Polyak rules and the falsifiable rate claims under standard convexity/Lipschitz assumptions.

major comments (2)
  1. [§3.2] §3.2 (safeguarded variant convergence): the O(1/sqrt(t)) last-iterate bound is stated up to a neighborhood controlled by (f_i(x_t) - L_i). The proof sketch bounds this residual but does not explicitly demonstrate that the adaptive steps drive f_i(x_t) to L_i at a rate compatible with the overall convergence when interpolation holds; without an additional contraction argument or lemma showing the extra term vanishes asymptotically, the 'vanishes under interpolation' claim does not follow directly from the given rate.
  2. [Theorem 1] Theorem 1 (oracle variant): the reduction to existing Polyak rules when momentum is zero is claimed, but the derivation of the schedule-free update with the Polyak step inserted should be expanded to show that no hidden constants or additional assumptions are introduced by the averaging mechanism.
minor comments (2)
  1. [§3.1] Notation for the lower bound L_i is introduced without a dedicated paragraph on practical selection when the model is not interpolating; a short discussion or heuristic would improve clarity.
  2. [Experiments] Figure 2 (or equivalent experimental plot): axis labels and legend entries for the different Polyak variants are slightly compressed; increasing font size or adding a caption table would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We address each major comment below and will revise the manuscript to strengthen the relevant sections.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (safeguarded variant convergence): the O(1/sqrt(t)) last-iterate bound is stated up to a neighborhood controlled by (f_i(x_t) - L_i). The proof sketch bounds this residual but does not explicitly demonstrate that the adaptive steps drive f_i(x_t) to L_i at a rate compatible with the overall convergence when interpolation holds; without an additional contraction argument or lemma showing the extra term vanishes asymptotically, the 'vanishes under interpolation' claim does not follow directly from the given rate.

    Authors: We agree that the current proof sketch would be strengthened by an explicit argument showing the residual vanishes asymptotically under interpolation. We will add a supporting lemma that establishes a contraction on the per-sample residuals driven by the adaptive Polyak steps, ensuring compatibility with the overall O(1/sqrt(t)) rate. This revision will make the claim fully rigorous. revision: yes

  2. Referee: [Theorem 1] Theorem 1 (oracle variant): the reduction to existing Polyak rules when momentum is zero is claimed, but the derivation of the schedule-free update with the Polyak step inserted should be expanded to show that no hidden constants or additional assumptions are introduced by the averaging mechanism.

    Authors: We appreciate the request for greater clarity. While the reduction follows by direct substitution of zero momentum into the schedule-free recurrence, we will expand the derivation (in the main text or appendix) with an explicit step-by-step expansion of the update rule. This will confirm that the averaging introduces neither hidden constants nor extra assumptions beyond the standard Polyak analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines adaptive Polyak-type step sizes directly from per-iteration sampled loss, gradient norm, and current iterate values (oracle variant using exact per-sample f_i^*; safeguarded variant using any lower bound L_i). It then proves an O(1/sqrt(t)) last-iterate rate for convex Lipschitz objectives under standard assumptions, with the neighborhood term vanishing under interpolation by the analysis. Both variants reduce to known Polyak rules for plain SGD at zero momentum. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claim are present. The derivation is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard convex optimization assumptions and the Schedule-Free framework from prior work; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Objectives are convex and Lipschitz continuous
    Invoked to obtain the O(1/sqrt(t)) last-iterate rate.
  • domain assumption A lower bound on the optimal value is available for the safeguarded variant
    Used to remove the oracle requirement while preserving the rate up to a vanishing neighborhood.

pith-pipeline@v0.9.0 · 5518 in / 1231 out tokens · 33324 ms · 2026-05-17T23:14:29.210348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We derive our adaptive learning rate by choosing γ_t that will bring iterate z_t closer to the solution x*. ... Minimizing this upper bound in γ_t gives our adaptive learning rate ... schedulep ... E[f(x_t)-f(x*)] ≤ G||x0-x*|| / √(t+1)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Analysis of schedule-free nonconvex optimization.arXiv preprint arXiv:2508.06743,

    Connor Brown. Analysis of schedule-free nonconvex optimization.arXiv preprint arXiv:2508.06743,

  2. [2]

    arXiv preprint

    URLhttps://arxiv.org/abs/ 2310.07831. arXiv preprint. Aaron Defazio, Xingyu Yang, Ahmed Khaled, Konstantin Mishchenko, Harsh Mehta, and Ashok Cutkosky. The road less scheduled.Advances in Neural Information Processing Systems, 37: 9974–10007,

  3. [3]

    Hazan and S

    Elad Hazan and Sham Kakade. Revisiting the polyak step size.arXiv preprint arXiv:1905.00313,

  4. [4]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,

  5. [5]

    Stochastic polyak step- size for sgd: An adaptive learning rate for fast convergence.arXiv:2002.10542,

    Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step- size for sgd: An adaptive learning rate for fast convergence.arXiv:2002.10542,

  6. [6]

    Oikonomou and N

    Dimitris Oikonomou and Nicolas Loizou. Stochastic Polyak step-sizes and momentum: Conver- gence guarantees and practical performance.arXiv preprint arXiv:2406.04142,

  7. [7]

    2 1.2 Contributions and Background

    11 CONTENTS 1 Introduction 1 1.1schedule-freeSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contributions and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Convergence Analysis and Implications 4 2.1 Surprising Predictive Power for Deep Learning . . . . . . . . . . . . . . . . . . . 4 2.2 Application tow...

  8. [8]

    TX t=1 wt⟨gt,z t −x ⋆⟩ # =E

    = T 3 − Tc 3 + 1.(31) 17 Combining, we have TX t=0 η2 t = 2Tw + 1 6 +T c −T w + T 3 − Tc 3 + 1 = T+ 2T c −2T w 3 + 7 6 ≤ 2 3 (T+T c −T w + 2).(32) Applying the results to Theorem 2.1 then establishes the corollary. A.5 COMMENTS ON THEWEIGHTSc t INDEFAZIO ET AL. (2024) Defazio et al. (2024) suggested the convergence rate ofO(1/ √ T)as long as the averaging...

  9. [9]

    As mentioned in Section 4, this is to avoid the complication of writing custom BatchNorm code to approximate batch statistics of thexsequence ofSchedule-free

    All experiments are based on the open-source frameworkhttps://github.com/fabian-sp/ step-back, which we extend to include theSchedule-freeoptimizer and to support Group- Norm normalization layers rather than BatchNorm for theResNet 8 andDenseNet 9 architec- tures. As mentioned in Section 4, this is to avoid the complication of writing custom BatchNorm cod...

  10. [10]

    We usewsdschedule forSGD-m,scheduletandSchedule-freewith ct = 1/tfrom previous theory, and usewamrup-stableschedule only forSchedule-freewith the heuristic parametersc t =γ 2 t /Pt i=1 γ2 i . Figure 6 shows the training performance (in terms of the training loss and the validation score) against the learning rate or the number of epochs when training aWid...