A Modern Introduction to Online Learning

Francesco Orabona

arxiv: 1912.13213 · v9 · submitted 2019-12-31 · 💻 cs.LG · math.OC· stat.ML

A Modern Introduction to Online Learning

Francesco Orabona This is my paper

Pith reviewed 2026-05-24 14:00 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords online learningonline convex optimizationonline mirror descentfollow the regularized leaderregret minimizationadaptive algorithmsparameter-free algorithmsbandit algorithms

0 comments

The pith

Online learning algorithms are presented as instantiations of Online Mirror Descent or Follow-The-Regularized-Leader with emphasis on adaptive parameter tuning and unbounded domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The book introduces online learning as regret minimization under worst-case assumptions using the framework of online convex optimization. It shows how first-order and second-order algorithms in Euclidean and non-Euclidean settings can be viewed as variants of Online Mirror Descent or Follow-The-Regularized-Leader. Special emphasis is placed on handling parameter tuning and learning in unbounded domains via adaptive and parameter-free methods. Non-convex losses are handled via convex surrogate losses and randomization, with coverage of the bandit setting and advanced topics including saddle-point optimization and applications to generalization theory.

Core claim

All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants, with particular attention to tuning parameters and learning in unbounded domains through adaptive and parameter-free algorithms.

What carries the argument

Online Mirror Descent and Follow-The-Regularized-Leader, which serve as the unifying frameworks for deriving and analyzing the algorithms.

If this is right

A uniform analysis applies to first-order and second-order algorithms across Euclidean and non-Euclidean settings.
Adaptive and parameter-free algorithms enable operation without prior knowledge of domain bounds.
Non-convex losses can be addressed through convex surrogates and randomization.
The framework extends to bandit problems, saddle-point optimization, and non-stationary regret analysis.
Online learning techniques yield results in generalization theory and concentration inequalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unified presentation may make it easier to transfer tuning techniques across different loss functions.
Connections to statistical learning could be tested by applying the regret bounds directly to derive new concentration results.

Load-bearing premise

The proofs have been chosen to be simple and short, which sometimes requires adding one or two additional assumptions just to simplify them.

What would settle it

Finding an online learning algorithm that cannot be expressed as an instantiation of Online Mirror Descent or Follow-The-Regularized-Leader or their variants would challenge the book's central framing.

read the original abstract

In this book, I introduce the basic concepts of Online Learning through the modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are addressed through convex surrogate losses and randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. Finally, I also cover advanced topics, including black-box reductions, saddle-point optimization, sequential investment, and non-stationary forms of regret analysis. The book concludes with a selection of applications of online learning to domains far from it, such as generalization theory and concentration inequalities. I tried to maintain an informal, but mathematically serious, tone throughout the book. No prior knowledge of convex analysis is required. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible. This also means that sometimes I have added one or two additional assumptions, just to simplify the proofs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a textbook compiling existing online convex optimization methods with no new results.

read the letter

This is a textbook compiling existing online convex optimization methods with no new results. Orabona presents first-order and second-order algorithms in Euclidean and non-Euclidean settings, all framed as variants of Online Mirror Descent or Follow-The-Regularized-Leader. The emphasis on adaptive tuning and parameter-free methods for unbounded domains is useful. Non-convex cases are handled with surrogates and randomization, and there's coverage of bandits plus advanced topics like saddle-point problems and non-stationary regret. The applications to generalization theory at the end connect it to statistical learning. This organization is the main strength. It gives a coherent view without requiring much background. The short proofs are chosen carefully, though the author notes they sometimes rely on added assumptions for simplicity. The soft spot is straightforward: no new results. It's all from the cited literature. If the goal is teaching or reference, that's fine, but it doesn't push any boundaries. Readers who are new to the area or teaching a course on it would get value. A reading group might use chapters on specific topics. I would recommend peer review for this as a monograph. It seems like a careful effort that could help organize the subfield.

Referee Report

0 major / 2 minor

Summary. The manuscript is an expository monograph introducing online learning via online convex optimization. It presents first- and second-order algorithms for convex losses in Euclidean and non-Euclidean geometries, uniformly framing them as instantiations of Online Mirror Descent (OMD) or Follow-The-Regularized-Leader (FTRL) and variants. Emphasis is placed on parameter tuning, adaptive and parameter-free methods for unbounded domains, convex surrogates for non-convex losses, the bandit setting, black-box reductions, saddle-point problems, sequential investment, non-stationary regret, and applications to generalization bounds and concentration inequalities. All proofs are selected to be short and elementary, sometimes at the cost of extra assumptions; no prior convex analysis is required.

Significance. If the derivations are accurate, the book offers a coherent modern synthesis that unifies disparate algorithms under OMD/FTRL, foregrounds practical issues of tuning and unbounded domains, and extends the material to applications outside core online learning. The explicit disclosure of proof simplifications and the informal-yet-rigorous tone are assets for accessibility.

minor comments (2)

[Abstract] Abstract and introduction: the statement that 'sometimes I have added one or two additional assumptions, just to simplify the proofs' is appropriate, but each such assumption must be explicitly flagged at the point it is introduced (e.g., in the statement of each theorem or in a dedicated 'assumptions' paragraph) so readers can evaluate its necessity.
Throughout: notation for regularizers, step-size schedules, and dual norms should be introduced once with a consolidated table or glossary; repeated re-definition of the same symbols across chapters risks confusion for readers new to the area.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, the accurate summary of the manuscript's scope and tone, and the recommendation of minor revision. We are pleased that the unification under OMD/FTRL, the emphasis on tuning and unbounded domains, and the accessibility for readers without prior convex analysis background are viewed favorably.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an expository monograph that presents standard algorithms as instantiations of Online Mirror Descent and Follow-The-Regularized-Leader without introducing new derivations, fitted parameters presented as predictions, or load-bearing self-citations. The abstract explicitly discloses the choice of simplified proofs with added assumptions, and the work relies on external prior literature rather than reducing any central claim to its own inputs by construction. No steps meet the criteria for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an introductory monograph on established topics in online convex optimization; the author introduces no new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5760 in / 1024 out tokens · 21139 ms · 2026-05-24T14:00:07.629023+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time
cs.LG 2026-05 unverdicted novelty 8.0

A robust variant of binary search achieves regret O(C + log T) for dynamic pricing with known corruption C and O(C + log² T) when unknown.
Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays
cs.LG 2026-05 unverdicted novelty 7.0

Prudent-Banker achieves pseudo-regret Õ(√T + √D) and Õ(1) regret vs. safe comparator in adversarial bandits both with and without delays, matching new lower bounds up to logs.
Bandit Convex Optimization with Gradient Prediction Adaptivity
cs.LG 2026-05 unverdicted novelty 7.0

TP-VR-OPT achieves O(√(d E[S_T])) prediction-adaptive regret in two-point bandit convex optimization, with a matching Ω(√E[S_T]) lower bound up to √d, while single-point feedback cannot benefit from predictions.
Online Conformal Prediction with Corrupted Feedback
cs.LG 2026-05 unverdicted novelty 7.0

Develops robust online conformal prediction schemes that provide explicit miscoverage guarantees under feedback modeled as arbitrary binary flips or bounded-memory errors.
Calibeating for general proper losses: A Bregman divergence approach
cs.LG 2026-05 unverdicted novelty 7.0

A Bregman divergence approach yields a unified calibeating framework for general proper losses, delivering U-calibration and logarithmic regret for Tsallis losses with weaker dimension dependence than prior work.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 7.0

Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Online Resource Allocation With General Constraints
cs.GT 2026-05 unverdicted novelty 7.0

An algorithm for online resource allocation with budget and general constraints achieves O(sqrt(T)) regret in stochastic and alpha-regret in adversarial regimes with bounded constraint violations.
Constrained Contextual Bandits with Adversarial Contexts
cs.LG 2026-05 unverdicted novelty 7.0

A modular reduction from budget-constrained contextual bandits with adversarial contexts to unconstrained bandits via surrogate rewards, yielding improved guarantees and an efficient algorithm based on SquareCB.
Online Localized Conformal Prediction
cs.LG 2026-05 unverdicted novelty 7.0

OLCP combines online adaptation with covariate-based localization to achieve valid long-run coverage and narrower intervals than global baselines in non-exchangeable settings.
Online Nonstochastic Prediction: Logarithmic Regret via Predictive Online Least Squares
cs.LG 2026-05 unverdicted novelty 7.0

Predictive hints from any stabilizing Luenberger observer make hint residuals uniformly bounded in online least squares, yielding logarithmic regret for nonstochastic prediction despite unbounded trajectories in margi...
Single-Period Portfolio Selection via Information Projection
cs.IT 2026-05 unverdicted novelty 7.0

CRRA portfolio selection equals Rényi information projection with the Rényi order matching the relative risk aversion coefficient, yielding a Blahut-Arimoto-style alternating optimizer that needs fewer iterations at l...
Single-Period Portfolio Selection via Information Projection
cs.IT 2026-05 unverdicted novelty 7.0

CRRA portfolio selection is equivalent to a Rényi information-projection problem whose order equals the investor's relative risk aversion, yielding an alternating optimization algorithm.
Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
stat.ML 2026-04 unverdicted novelty 7.0

A framework for concave distributional utility maximization in stochastic bandits via influence-function stochastic gradients and entropic mirror ascent on the simplex, with regret bounds.
FedSEA: Achieving Benefit of Parallelization in Federated Online Learning
cs.LG 2026-04 unverdicted novelty 7.0

FedSEA achieves O(sqrt(T)) regret for smooth convex losses and O(log T) for smooth strongly convex losses in federated online learning under stochastic adversary, with parallelization benefits when temporal heterogene...
Gradient-Variation Regret Bounds for Unconstrained Online Learning
cs.LG 2026-04 unverdicted novelty 7.0

Parameter-free algorithms for unconstrained online learning achieve regret bounds of order O(||u|| sqrt(V_T(u)) + L||u||^2 + G^4) for L-smooth convex losses without prior knowledge of ||u||, L or G, with extensions to...
Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications
cs.LG 2026-04 unverdicted novelty 7.0

Optimal regret bounds O(δ^{-1/2}√T) for convex and O(δ^{-1} log T) for strongly convex losses are achieved in distributed online convex optimization under compressed communication.
Learning Safely Without Knowing the World:COMPASS-Hedge
cs.LG 2026-03 unverdicted novelty 7.0

COMPASS-Hedge is presented as the first parameter-free full-information anytime algorithm that simultaneously delivers minimax-optimal adversarial regret, instance-optimal stochastic regret, and Õ(1) regret to a basel...
Non-Stationary Online Structured Prediction with Surrogate Losses
cs.LG 2025-10 unverdicted novelty 7.0

In non-stationary online structured prediction, cumulative target loss is bounded by F_T + O(1 + P_T) using dynamic regret of OGD combined with surrogate-gap exploitation.
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
cs.LG 2026-05 unverdicted novelty 6.0

A projection-based algorithm for COCO achieves O(log T) regret and O(log T) CCV for strongly convex losses and O(sqrt(T)) for convex losses by leveraging self-contracted curves.
Optimistic Dual Averaging Unifies Modern Optimizers
cs.LG 2026-05 unverdicted novelty 6.0

SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
Online Sharp-Calibrated Bayesian Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OSCBO adaptively balances Gaussian process sharpness and calibration in Bayesian optimization by casting hyperparameter selection as constrained online learning, while preserving sublinear regret bounds.
Online Localized Conformal Prediction
cs.LG 2026-05 unverdicted novelty 6.0

OLCP and OLCP-Hedge achieve long-run valid coverage in non-exchangeable online settings with narrower prediction sets by localizing conformal prediction to covariates and selecting bandwidth via online convex optimization.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
Partially Lazy Gradient Descent for Smoothed Online Learning
cs.LG 2026-01 unverdicted novelty 6.0

k-lazyGD achieves optimal dynamic regret O(sqrt((P_T+1)T)) in SOCO for laziness k up to Theta(sqrt(T/P_T)).
Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data
cs.LG 2025-12 unverdicted novelty 6.0

A sub-Gaussian mixture achieves almost sure ln ln V_T regret on unbounded data via a pathwise bound that holds on the probability-one Ville event.
Implicit score-driven filters for time-varying parameter models
stat.ME 2025-12 unverdicted novelty 6.0

Implicit score-driven updates preserve the full observation density to deliver global stability and mean-squared-error contraction toward the pseudo-true parameter for log-concave densities in time-varying parameter models.
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
cs.LG 2025-09 conditional novelty 6.0

Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
When Determinants Are Not Enough: Private Rare Switching
cs.LG 2026-05 unverdicted novelty 5.0

Replaces determinant growth with generalized Rayleigh quotient for rare switching in private linear bandits to control worst-direction volume despite non-monotonic design matrices from noise.
A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound
cs.LG 2026-04 unverdicted novelty 5.0

Shifted KT potentials equal a prior change in KT, and this removes the ln ln T factor from Squint's data-independent bound.
Revisiting Active Sequential Prediction-Powered Mean Estimation
stat.ML 2026-04 unverdicted novelty 5.0

Non-asymptotic analysis of prediction-powered mean estimation shows that no-regret learning for query probabilities converges to the maximum allowed constant value, independent of covariates.
Distributed Associative Memory via Online Convex Optimization
cs.LG 2025-09 unverdicted novelty 5.0

A distributed online convex optimization protocol for associative memory achieves sublinear regret guarantees and outperforms baselines in experiments.
The Bayesian Reflex: Online Learning as the Autonomic Nervous System of Modern and Future AI
stat.ME 2026-05 unverdicted novelty 3.0

The Bayesian reflex unifies online Bayesian learning through belief maintenance, Bayes' theorem updates, and exploration-exploitation balancing, with extensions to climate modeling, time series, prime number discovery...
Stochastic Optimization and Data Science
math.OC 2026-05 unverdicted novelty 2.0

The paper motivates stochastic optimization problems from statistical perspectives and describes offline and online approaches to solve expectation minimization problems.