Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
Pith reviewed 2026-05-21 13:12 UTC · model grok-4.3
The pith
Generalized smoothness makes warm-up and decay the natural schedule for norm-constrained optimizers
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training.
What carries the argument
Generalized smoothness assumption, which states that local curvature decreases with the suboptimality gap and is used to derive a learning rate schedule with natural warm-up and decay phases.
If this is right
- Convergence is guaranteed for the derived learning rate schedule under the assumption.
- The scheduler automatically determines warm-up length at training start.
- It performs at least as well as best manual schedules on LLM pretraining with LLaMA.
- Only standard hyperparameters are needed, eliminating extra search for warm-up.
Where Pith is reading between the lines
- This could explain the empirical success of warm-up in many deep learning trainings if the assumption holds broadly.
- The adaptive scheduler might be tested on other model architectures or tasks to see if automatic warm-up selection generalizes.
- If the curvature behavior is verified in more settings, it could guide design of other adaptive methods.
Load-bearing premise
Local curvature decreases with the suboptimality gap.
What would settle it
If measurements along training show that the local curvature does not decrease as the suboptimality gap shrinks, the central assumption would be falsified.
read the original abstract
We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a generalized smoothness assumption under which local curvature decreases with the suboptimality gap. Under this assumption, convergence guarantees are derived for norm-constrained optimizers (e.g., Muon, Lion), from which a learning-rate schedule with natural warm-up followed by decay emerges. A practical adaptive scheduler is then constructed that automatically selects warm-up duration using only standard hyperparameters. The scheduler is evaluated on LLaMA-based LLM pretraining and reported to match or exceed the best manually tuned warm-up schedules without extra hyperparameter search.
Significance. If the generalized smoothness assumption is valid in the relevant regimes and the derivation is tight, the work supplies a theoretical origin for warm-up in norm-constrained methods and a reproducible adaptive rule that reduces manual tuning. The open-source implementation and empirical results on large-scale pretraining add practical value. The contribution is tempered by the load-bearing role of the assumption, whose empirical support is trajectory-based rather than global.
major comments (2)
- [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the convergence theorem relies on the generalized smoothness assumption holding with sufficient strength to control the step-size schedule throughout training. The reported empirical checks verify the curvature-suboptimality relation only along observed trajectories; this does not automatically guarantee the uniform bounds needed for the global guarantees, especially under large-batch LLM pretraining regimes.
- [Section deriving the adaptive scheduler] Section deriving the adaptive scheduler (around the practical algorithm): the claim that the scheduler 'relies only on standard hyperparameters' is not accompanied by a sensitivity analysis showing that the automatic warm-up adaptation remains stable when those hyperparameters vary within typical ranges used in LLM training.
minor comments (2)
- [Empirical verification figure] Figure showing curvature vs. suboptimality gap: adding shaded regions for multiple runs or seeds would clarify the consistency of the observed relation.
- [Preliminaries] Notation for the norm-constrained update rules: a short table comparing the update forms of Muon, Lion, and Adam would aid readers unfamiliar with the family.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the convergence theorem relies on the generalized smoothness assumption holding with sufficient strength to control the step-size schedule throughout training. The reported empirical checks verify the curvature-suboptimality relation only along observed trajectories; this does not automatically guarantee the uniform bounds needed for the global guarantees, especially under large-batch LLM pretraining regimes.
Authors: We agree that the empirical verification is performed along observed trajectories and does not establish a priori uniform bounds. The convergence theorem is stated conditionally on the generalized smoothness assumption holding with the required strength. In the revision we will add explicit language in the theoretical section clarifying this scope and discussing the empirical support specifically in large-batch regimes, while noting that the practical scheduler is further validated by direct experiments. revision: partial
-
Referee: [Section deriving the adaptive scheduler] Section deriving the adaptive scheduler (around the practical algorithm): the claim that the scheduler 'relies only on standard hyperparameters' is not accompanied by a sensitivity analysis showing that the automatic warm-up adaptation remains stable when those hyperparameters vary within typical ranges used in LLM training.
Authors: We accept that a sensitivity analysis is needed to substantiate the stability claim. We will add this analysis to the revised manuscript, evaluating the adaptive warm-up behavior under variations of the standard hyperparameters within the ranges commonly used for LLM pretraining. revision: yes
Circularity Check
Derivation self-contained under externally stated assumption
full rationale
The paper introduces a generalized smoothness assumption (local curvature decreases with suboptimality gap) as an explicit modeling choice, empirically verifies the relation along observed trajectories, and derives convergence guarantees for norm-constrained optimizers under that assumption. The warm-up-then-decay schedule is shown to emerge from the structure of the proof rather than being fitted or defined in terms of the target performance. No load-bearing self-citations, self-definitional steps, or fitted inputs renamed as predictions are present; the central claim retains independent content from the stated assumption and analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generalized smoothness assumption under which local curvature decreases with the suboptimality gap
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap ... warm-up followed by decay arises naturally from the proof rather than being imposed heuristically.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
η(Δ) = Δ / (K0 + K1 Δ + K2 Δ²) ... transition occurs at the point Δ' obtained by maximizing η(Δ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Muon Does Not Converge on Convex Lipschitz Functions
Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.