A Modern Introduction to Online Learning
Pith reviewed 2026-05-24 14:00 UTC · model grok-4.3
The pith
Online learning algorithms are presented as instantiations of Online Mirror Descent or Follow-The-Regularized-Leader with emphasis on adaptive parameter tuning and unbounded domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants, with particular attention to tuning parameters and learning in unbounded domains through adaptive and parameter-free algorithms.
What carries the argument
Online Mirror Descent and Follow-The-Regularized-Leader, which serve as the unifying frameworks for deriving and analyzing the algorithms.
If this is right
- A uniform analysis applies to first-order and second-order algorithms across Euclidean and non-Euclidean settings.
- Adaptive and parameter-free algorithms enable operation without prior knowledge of domain bounds.
- Non-convex losses can be addressed through convex surrogates and randomization.
- The framework extends to bandit problems, saddle-point optimization, and non-stationary regret analysis.
- Online learning techniques yield results in generalization theory and concentration inequalities.
Where Pith is reading between the lines
- The unified presentation may make it easier to transfer tuning techniques across different loss functions.
- Connections to statistical learning could be tested by applying the regret bounds directly to derive new concentration results.
Load-bearing premise
The proofs have been chosen to be simple and short, which sometimes requires adding one or two additional assumptions just to simplify them.
What would settle it
Finding an online learning algorithm that cannot be expressed as an instantiation of Online Mirror Descent or Follow-The-Regularized-Leader or their variants would challenge the book's central framing.
read the original abstract
In this book, I introduce the basic concepts of Online Learning through the modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are addressed through convex surrogate losses and randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. Finally, I also cover advanced topics, including black-box reductions, saddle-point optimization, sequential investment, and non-stationary forms of regret analysis. The book concludes with a selection of applications of online learning to domains far from it, such as generalization theory and concentration inequalities. I tried to maintain an informal, but mathematically serious, tone throughout the book. No prior knowledge of convex analysis is required. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible. This also means that sometimes I have added one or two additional assumptions, just to simplify the proofs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is an expository monograph introducing online learning via online convex optimization. It presents first- and second-order algorithms for convex losses in Euclidean and non-Euclidean geometries, uniformly framing them as instantiations of Online Mirror Descent (OMD) or Follow-The-Regularized-Leader (FTRL) and variants. Emphasis is placed on parameter tuning, adaptive and parameter-free methods for unbounded domains, convex surrogates for non-convex losses, the bandit setting, black-box reductions, saddle-point problems, sequential investment, non-stationary regret, and applications to generalization bounds and concentration inequalities. All proofs are selected to be short and elementary, sometimes at the cost of extra assumptions; no prior convex analysis is required.
Significance. If the derivations are accurate, the book offers a coherent modern synthesis that unifies disparate algorithms under OMD/FTRL, foregrounds practical issues of tuning and unbounded domains, and extends the material to applications outside core online learning. The explicit disclosure of proof simplifications and the informal-yet-rigorous tone are assets for accessibility.
minor comments (2)
- [Abstract] Abstract and introduction: the statement that 'sometimes I have added one or two additional assumptions, just to simplify the proofs' is appropriate, but each such assumption must be explicitly flagged at the point it is introduced (e.g., in the statement of each theorem or in a dedicated 'assumptions' paragraph) so readers can evaluate its necessity.
- Throughout: notation for regularizers, step-size schedules, and dual norms should be introduced once with a consolidated table or glossary; repeated re-definition of the same symbols across chapters risks confusion for readers new to the area.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the accurate summary of the manuscript's scope and tone, and the recommendation of minor revision. We are pleased that the unification under OMD/FTRL, the emphasis on tuning and unbounded domains, and the accessibility for readers without prior convex analysis background are viewed favorably.
Circularity Check
No significant circularity
full rationale
This is an expository monograph that presents standard algorithms as instantiations of Online Mirror Descent and Follow-The-Regularized-Leader without introducing new derivations, fitted parameters presented as predictions, or load-bearing self-citations. The abstract explicitly discloses the choice of simplified proofs with added assumptions, and the work relies on external prior literature rather than reducing any central claim to its own inputs by construction. No steps meet the criteria for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 34 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
-
Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays
Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.
-
Bandit Convex Optimization with Gradient Prediction Adaptivity
TP-VR-OPT achieves O(√(d E[S_T])) prediction-adaptive regret in two-point bandit convex optimization, with a matching Ω(√E[S_T]) lower bound up to √d, while single-point feedback cannot benefit from predictions.
-
Online Conformal Prediction with Corrupted Feedback
Develops robust online conformal prediction schemes that provide explicit miscoverage guarantees under feedback modeled as arbitrary binary flips or bounded-memory errors.
-
Calibeating for general proper losses: A Bregman divergence approach
A Bregman divergence approach yields a unified calibeating framework for general proper losses, delivering U-calibration and logarithmic regret for Tsallis losses with weaker dimension dependence than prior work.
-
Online Learning-to-Defer with Varying Experts
Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Online Resource Allocation With General Constraints
An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
-
Constrained Contextual Bandits with Adversarial Contexts
A modular reduction from budget-constrained contextual bandits with adversarial contexts to unconstrained bandits via surrogate rewards, yielding improved guarantees and an efficient algorithm based on SquareCB.
-
Online Localized Conformal Prediction
OLCP combines online adaptation with covariate-based localization to achieve valid long-run coverage and narrower intervals than global baselines in non-exchangeable settings.
-
Online Nonstochastic Prediction: Logarithmic Regret via Predictive Online Least Squares
Predictive hints from any stabilizing Luenberger observer make hint residuals uniformly bounded in online least squares, yielding logarithmic regret for nonstochastic prediction despite unbounded trajectories in margi...
-
Single-Period Portfolio Selection via Information Projection
CRRA portfolio selection equals Rényi information projection with the Rényi order matching the relative risk aversion coefficient, yielding a Blahut-Arimoto-style alternating optimizer that needs fewer iterations at l...
-
Single-Period Portfolio Selection via Information Projection
CRRA portfolio selection is equivalent to a Rényi information-projection problem whose order equals the investor's relative risk aversion, yielding an alternating optimization algorithm.
-
Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
A framework for concave distributional utility maximization in stochastic bandits via influence-function stochastic gradients and entropic mirror ascent on the simplex, with regret bounds.
-
FedSEA: Achieving Benefit of Parallelization in Federated Online Learning
FedSEA achieves O(sqrt(T)) regret for smooth convex losses and O(log T) for smooth strongly convex losses in federated online learning under stochastic adversary, with parallelization benefits when temporal heterogene...
-
Gradient-Variation Regret Bounds for Unconstrained Online Learning
Parameter-free algorithms for unconstrained online learning achieve regret bounds of order O(||u|| sqrt(V_T(u)) + L||u||^2 + G^4) for L-smooth convex losses without prior knowledge of ||u||, L or G, with extensions to...
-
Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
Optimal regret bounds O(δ^{-1/2}√T) for convex and O(δ^{-1} log T) for strongly convex losses are achieved in distributed online convex optimization under compressed communication.
-
Learning Safely Without Knowing the World:COMPASS-Hedge
COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a basel...
-
Non-Stationary Online Structured Prediction with Surrogate Losses
In non-stationary online structured prediction, cumulative target loss is bounded by F_T + O(1 + P_T) using dynamic regret of OGD combined with surrogate-gap exploitation.
-
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
A projection-based algorithm for COCO achieves O(log T) regret and O(log T) CCV for strongly convex losses and O(sqrt(T)) for convex losses by leveraging self-contracted curves.
-
Optimistic Dual Averaging Unifies Modern Optimizers
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
-
Online Sharp-Calibrated Bayesian Optimization
OSCBO adaptively balances Gaussian process sharpness and calibration in Bayesian optimization by casting hyperparameter selection as constrained online learning, while preserving sublinear regret bounds.
-
Online Localized Conformal Prediction
OLCP and OLCP-Hedge achieve long-run valid coverage in non-exchangeable online settings with narrower prediction sets by localizing conformal prediction to covariates and selecting bandwidth via online convex optimization.
-
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
-
Partially Lazy Gradient Descent for Smoothed Online Learning
k-lazyGD achieves optimal dynamic regret O(sqrt((P_T+1)T)) in SOCO for laziness k up to Theta(sqrt(T/P_T)).
-
Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data
A sub-Gaussian mixture achieves almost sure ln ln V_T regret on unbounded data via a pathwise bound that holds on the probability-one Ville event.
-
Implicit score-driven filters for time-varying parameter models
Implicit score-driven updates preserve the full observation density to deliver global stability and mean-squared-error contraction toward the pseudo-true parameter for log-concave densities in time-varying parameter models.
-
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
-
When Determinants Are Not Enough: Private Rare Switching
Replaces determinant growth with generalized Rayleigh quotient for rare switching in private linear bandits to control worst-direction volume despite non-monotonic design matrices from noise.
-
A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound
Shifted KT potentials equal a prior change in KT, and this removes the ln ln T factor from Squint's data-independent bound.
-
Revisiting Active Sequential Prediction-Powered Mean Estimation
Non-asymptotic analysis of prediction-powered mean estimation shows that no-regret learning for query probabilities converges to the maximum allowed constant value, independent of covariates.
-
Distributed Associative Memory via Online Convex Optimization
A distributed online convex optimization protocol for associative memory achieves sublinear regret guarantees and outperforms baselines in experiments.
-
The Bayesian Reflex: Online Learning as the Autonomic Nervous System of Modern and Future AI
The Bayesian reflex unifies online Bayesian learning through belief maintenance, Bayes' theorem updates, and exploration-exploitation balancing, with extensions to climate modeling, time series, prime number discovery...
-
Stochastic Optimization and Data Science
The paper motivates stochastic optimization problems from statistical perspectives and describes offline and online approaches to solve expectation minimization problems.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.