Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

Aleksandr Beznosikov; Andrey Veprikov; Arman Bolatov; Artem Riabinin; Martin Tak\'a\v{c}

arxiv: 2602.05813 · v2 · pith:NQFMLTUXnew · submitted 2026-02-05 · 💻 cs.LG · math.OC

Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

Artem Riabinin , Andrey Veprikov , Arman Bolatov , Martin Tak\'a\v{c} , Aleksandr Beznosikov This is my paper

Pith reviewed 2026-05-21 13:12 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords adaptive learning rate schedulingwarm-upnorm-constrained optimizersgeneralized smoothnessconvergence guaranteesLLM pretrainingMuonLion

0 comments

The pith

Generalized smoothness makes warm-up and decay the natural schedule for norm-constrained optimizers

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to derive where warm-up comes from in adaptive learning rate scheduling for norm-constrained optimizers such as Muon and Lion. It introduces and empirically verifies a generalized smoothness assumption under which local curvature decreases with the suboptimality gap. Under this assumption, convergence guarantees are proven for a learning rate choice from which warm-up followed by decay emerges directly in the proof. A practical scheduler is then developed that automatically adapts the warm-up duration using only standard hyperparameters. On LLaMA architecture pretraining, this adaptive method consistently matches or exceeds the performance of the best manually tuned warm-up schedules without requiring additional hyperparameter searches.

Core claim

We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training.

What carries the argument

Generalized smoothness assumption, which states that local curvature decreases with the suboptimality gap and is used to derive a learning rate schedule with natural warm-up and decay phases.

If this is right

Convergence is guaranteed for the derived learning rate schedule under the assumption.
The scheduler automatically determines warm-up length at training start.
It performs at least as well as best manual schedules on LLM pretraining with LLaMA.
Only standard hyperparameters are needed, eliminating extra search for warm-up.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could explain the empirical success of warm-up in many deep learning trainings if the assumption holds broadly.
The adaptive scheduler might be tested on other model architectures or tasks to see if automatic warm-up selection generalizes.
If the curvature behavior is verified in more settings, it could guide design of other adaptive methods.

Load-bearing premise

Local curvature decreases with the suboptimality gap.

What would settle it

If measurements along training show that the local curvature does not decrease as the suboptimality gap shrinks, the central assumption would be falsified.

read the original abstract

We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an adaptive warm-up schedule for norm-constrained optimizers from a new generalized smoothness assumption and shows competitive results on LLaMA pretraining.

read the letter

The main point is that under a generalized smoothness assumption where local curvature decreases with the suboptimality gap, a warm-up followed by decay schedule emerges naturally from the convergence analysis for optimizers like Muon and Lion instead of being added by hand. They verify the assumption holds along observed trajectories and turn the theory into a practical scheduler that sets warm-up length automatically from standard hyperparameters. On LLaMA pretraining it matches or beats the best manual schedules across setups without extra search, and the code is released, which helps check the claims directly. That combination of derivation plus usable method is the useful part here. The assumption is the load-bearing piece. The empirical checks along trajectories give some support, but the proof needs the relation to control the schedule globally through training. If the curvature-suboptimality link is only approximate or fails in early phases or large-batch regimes common in LLM work, the theoretical justification thins out and the scheduler becomes more of a working heuristic. The abstract leaves the exact strength of the bounds unclear. This is aimed at people tuning optimizers for large models who want to cut down on schedule search. A reader focused on adaptive methods or norm-constrained training would pick up the experiments and the released code. It has enough new assumption, derivation, and relevant results to deserve referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a generalized smoothness assumption under which local curvature decreases with the suboptimality gap. Under this assumption, convergence guarantees are derived for norm-constrained optimizers (e.g., Muon, Lion), from which a learning-rate schedule with natural warm-up followed by decay emerges. A practical adaptive scheduler is then constructed that automatically selects warm-up duration using only standard hyperparameters. The scheduler is evaluated on LLaMA-based LLM pretraining and reported to match or exceed the best manually tuned warm-up schedules without extra hyperparameter search.

Significance. If the generalized smoothness assumption is valid in the relevant regimes and the derivation is tight, the work supplies a theoretical origin for warm-up in norm-constrained methods and a reproducible adaptive rule that reduces manual tuning. The open-source implementation and empirical results on large-scale pretraining add practical value. The contribution is tempered by the load-bearing role of the assumption, whose empirical support is trajectory-based rather than global.

major comments (2)

[Abstract and theoretical analysis section] Abstract and theoretical analysis section: the convergence theorem relies on the generalized smoothness assumption holding with sufficient strength to control the step-size schedule throughout training. The reported empirical checks verify the curvature-suboptimality relation only along observed trajectories; this does not automatically guarantee the uniform bounds needed for the global guarantees, especially under large-batch LLM pretraining regimes.
[Section deriving the adaptive scheduler] Section deriving the adaptive scheduler (around the practical algorithm): the claim that the scheduler 'relies only on standard hyperparameters' is not accompanied by a sensitivity analysis showing that the automatic warm-up adaptation remains stable when those hyperparameters vary within typical ranges used in LLM training.

minor comments (2)

[Empirical verification figure] Figure showing curvature vs. suboptimality gap: adding shaded regions for multiple runs or seeds would clarify the consistency of the observed relation.
[Preliminaries] Notation for the norm-constrained update rules: a short table comparing the update forms of Muon, Lion, and Adam would aid readers unfamiliar with the family.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the convergence theorem relies on the generalized smoothness assumption holding with sufficient strength to control the step-size schedule throughout training. The reported empirical checks verify the curvature-suboptimality relation only along observed trajectories; this does not automatically guarantee the uniform bounds needed for the global guarantees, especially under large-batch LLM pretraining regimes.

Authors: We agree that the empirical verification is performed along observed trajectories and does not establish a priori uniform bounds. The convergence theorem is stated conditionally on the generalized smoothness assumption holding with the required strength. In the revision we will add explicit language in the theoretical section clarifying this scope and discussing the empirical support specifically in large-batch regimes, while noting that the practical scheduler is further validated by direct experiments. revision: partial
Referee: [Section deriving the adaptive scheduler] Section deriving the adaptive scheduler (around the practical algorithm): the claim that the scheduler 'relies only on standard hyperparameters' is not accompanied by a sensitivity analysis showing that the automatic warm-up adaptation remains stable when those hyperparameters vary within typical ranges used in LLM training.

Authors: We accept that a sensitivity analysis is needed to substantiate the stability claim. We will add this analysis to the revised manuscript, evaluating the adaptive warm-up behavior under variations of the standard hyperparameters within the ranges commonly used for LLM pretraining. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained under externally stated assumption

full rationale

The paper introduces a generalized smoothness assumption (local curvature decreases with suboptimality gap) as an explicit modeling choice, empirically verifies the relation along observed trajectories, and derives convergence guarantees for norm-constrained optimizers under that assumption. The warm-up-then-decay schedule is shown to emerge from the structure of the proof rather than being fitted or defined in terms of the target performance. No load-bearing self-citations, self-definitional steps, or fitted inputs renamed as predictions are present; the central claim retains independent content from the stated assumption and analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing element is the generalized smoothness assumption itself; no free parameters or invented entities are mentioned in the abstract. The assumption is treated as a domain_assumption that is empirically verified along trajectories.

axioms (1)

domain assumption Generalized smoothness assumption under which local curvature decreases with the suboptimality gap
Invoked to establish convergence guarantees and to derive the natural emergence of warm-up followed by decay.

pith-pipeline@v0.9.0 · 5714 in / 1318 out tokens · 28544 ms · 2026-05-21T13:12:13.241218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap ... warm-up followed by decay arises naturally from the proof rather than being imposed heuristically.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

η(Δ) = Δ / (K0 + K1 Δ + K2 Δ²) ... transition occurs at the point Δ' obtained by maximizing η(Δ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Muon Does Not Converge on Convex Lipschitz Functions
cs.LG 2026-05 unverdicted novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.