pith. sign in

arxiv: 2601.12238 · v4 · pith:GITJNNV6new · submitted 2026-01-18 · 📊 stat.ML · cs.LG· math.OC

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Pith reviewed 2026-05-21 15:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC
keywords momentum SGDnonstationary optimizationtracking errordistribution shiftdrift amplificationminimax lower boundstime-varying objectives
0
0 comments X

The pith

Momentum SGD incurs a drift-amplification penalty under distribution shift that grows as the momentum parameter approaches 1, producing systematic tracking lag that vanilla SGD avoids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines SGD and its momentum variants for tracking optima in time-varying stochastic problems that remain strongly convex and smooth. It derives finite-time bounds that split the tracking error into transient, noise-driven, and drift-driven parts. The decomposition reveals that higher momentum values amplify the drift component, creating lag that worsens as the momentum parameter nears 1. Matching minimax lower bounds under gradient-variation limits show the penalty is unavoidable in drift-heavy regimes rather than an artifact of the upper-bound analysis.

Core claim

Under the stated strong-convexity and smoothness assumptions, momentum methods introduce an explicit drift-amplification term in the tracking error that diverges with the momentum parameter β approaching 1. This produces systematic lag behind the moving optimum. The upper bounds are paired with minimax lower bounds over gradient-variation classes, establishing that stale-gradient averaging forces an information-theoretic barrier: in drift-dominated regimes momentum is provably inferior to plain SGD.

What carries the argument

The tracking-error decomposition that isolates a drift-induced component whose size grows with the momentum parameter β.

If this is right

  • In drift-dominated regimes vanilla SGD provably tracks the moving optimum more accurately than momentum variants.
  • The information-theoretic lower bounds delimit the regime boundaries where momentum becomes harmful.
  • The explicit penalty explains observed instability of momentum in dynamic environments.
  • Practitioners can use the decomposition to decide when to disable momentum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar lag penalties may appear in other accelerated first-order methods under nonstationarity.
  • The result suggests testing momentum decay schedules that keep β well below 1 during periods of high drift.
  • Extensions could relax strong convexity to see whether the barrier persists in convex or nonconvex drifting settings.

Load-bearing premise

The time-varying objectives remain strongly convex and smooth throughout the analysis.

What would settle it

In a controlled strongly convex and smooth nonstationary problem with measurable gradient variations, run momentum SGD at several β values approaching 1 and record whether the observed tracking error increases with β; if the error stays flat or decreases, the claimed penalty does not hold.

Figures

Figures reproduced from arXiv: 2601.12238 by Cameron J. Hogan, Martin T. Wells, Sharan Sahu.

Figure 1
Figure 1. Figure 1: Two minimax regimes and the inertia window. (a) After a regime change, momentum averages past gradients: this reduces noise but induces inertia, so the iterate lags behind the drifting minimizer for an “inertia window” whose length grows with β. (b) The minimax lower bound in Theorem 3.7 is the maximum of two contributions: a noise/variation-limited term (information-limited) and an inertia-limited term (d… view at source ↗
Figure 2
Figure 2. Figure 2: Tracking a drifting minimizer under strong convexity. Mean squared tracking error versus time for SGD, Heavy-Ball (HB), and Nesterov (NAG) on a strongly convex quadratic with γ = 0.1, β = 0.9, σ2 = 0.8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structure of the proof: Arrows represent the main logical implications between results. The section Tracking Bounds in Expectation develops the core recursive analysis for both (SGD) (C.1–C.4) and (SGDM) (C.5–C.9), including the key Φe → Φ conversion (C.7). High-Probability Bounds lifts these results to pathwise guarantees via concentration arguments (D.1–D.4). The Minimax Lower Bounds section constructs t… view at source ↗
Figure 4
Figure 4. Figure 4: Non-stationary benchmark suite. Results for Linear and Logistic tasks. (Continued on next page.) 63 [PITH_FULL_IMAGE:figures/full_fig_p063_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Non-stationary benchmark suite. Mean ± 1 std over 20 seeds for well-conditioned (κ = 10) and ill-conditioned (κ = 1000) regimes. Tracking error: parameter-space for linear/logistic, prediction-space for MLP. Step sizes respect the stability bounds from Theorem 3.3. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_4.png] view at source ↗
read the original abstract

In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $\beta$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes SGD and momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Finite-time upper bounds decompose tracking error into transient, noise-induced, and drift-induced terms, exposing a drift-amplification penalty for momentum that diverges as β approaches 1. These upper bounds are paired with minimax lower bounds under gradient-variation constraints, establishing the penalty as an information-theoretic barrier in drift-dominated regimes where momentum is provably worse than vanilla SGD.

Significance. If the bounds are correct, the work supplies a precise theoretical account of why momentum can exhibit instability under distribution shift, while identifying explicit regime boundaries where SGD outperforms its accelerated counterparts. The upper/lower bound pairing is a strength, as it converts an empirical observation into a falsifiable, information-theoretic statement rather than an artifact of analysis technique.

major comments (1)
  1. [lower-bound section (minimax result)] The central claim that the drift penalty is information-theoretic rests on the gradient-variation constraint in the lower-bound construction. Please confirm in the proof of the minimax lower bound that the variation measure is defined identically to the one appearing in the upper-bound drift term; any mismatch would weaken the matching claim.
minor comments (2)
  1. The abstract states that both Polyak and Nesterov momentum incur the same drift penalty; the main text should explicitly note whether the constants differ between the two variants or whether the analysis treats them uniformly.
  2. [finite-time bounds] All finite-time bounds should list the explicit dependence of the drift term on β and on the gradient-variation measure so that readers can directly recover the regime boundaries without re-deriving the expressions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and insightful comment on the minimax lower bound. We are pleased that the significance of the work is recognized. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: [lower-bound section (minimax result)] The central claim that the drift penalty is information-theoretic rests on the gradient-variation constraint in the lower-bound construction. Please confirm in the proof of the minimax lower bound that the variation measure is defined identically to the one appearing in the upper-bound drift term; any mismatch would weaken the matching claim.

    Authors: We confirm that the variation measure is defined identically in both the upper and lower bounds. In the upper-bound analysis (Theorem 3), the drift-induced error term is bounded in terms of the total gradient variation V_G = sum_{t=1}^T ||nabla f_t(x_t^*) - nabla f_{t-1}(x_{t-1}^*)||, where x_t^* denotes the time-varying optimum. The minimax lower bound construction in Section 4 employs precisely the same constraint: the adversary is restricted to sequences where the cumulative gradient variation is at most V_G. This identical definition ensures that the lower bound applies directly to the regimes analyzed in the upper bounds, establishing that the drift-amplification penalty for momentum is indeed an information-theoretic barrier rather than an artifact of the analysis. To make this equivalence more explicit, we will include a clarifying remark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives finite-time upper bounds decomposing tracking error into transient, noise-induced, and drift-induced terms for SGD and momentum methods under strong convexity and smoothness, explicitly expressing the drift-amplification penalty in terms of the momentum parameter β and gradient variation. These are paired with independent minimax lower bounds under gradient-variation constraints. No step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain; the bounds are self-contained mathematical derivations within the stated assumptions and do not rely on renaming or smuggling ansatzes from prior self-work. This matches the default expectation of no significant circularity (score 0-2) for papers with explicit, falsifiable bound derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard domain assumptions of strong convexity and smoothness plus a gradient-variation constraint; no free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption The time-varying objective functions are strongly convex and smooth.
    Invoked to obtain the finite-time tracking error bounds and the decomposition into transient, noise, and drift components.
  • domain assumption Gradient variation is bounded in the minimax lower-bound construction.
    Used to establish the information-theoretic barrier for momentum methods.

pith-pipeline@v0.9.0 · 5711 in / 1380 out tokens · 54206 ms · 2026-05-21T15:29:14.951043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

    stat.ML 2026-05 unverdicted novelty 6.0

    Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, w...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Cevher, V ., Becker, S., and Schmidt, M

    doi: 10.23919/ACC.2019.8814889. Cevher, V ., Becker, S., and Schmidt, M. Convex optimiza- tion for big data: Scalable, randomized, and parallel al- gorithms for big data analytics.IEEE Signal Processing Magazine, 31(5):32–43, 2014. doi: 10.1109/MSP.2014. 2329397. Chen*, J., Pan*, X., Monga, R., Bengio, S., and Joze- fowicz, R. Revisiting distributed synch...

  2. [2]

    , author Zhou, A

    URL https://openreview.net/forum? id=HyAddcLge. Chen, J., Wolfe, C., Li, Z., and Kyrillidis, A. Demon: Improved neural network training with momentum decay. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3958–3962, 2022. doi: 10.1109/ICASSP43922.2022. 9746839. Chen, X., Wang, Y ., and Wang, Y .-...

  3. [3]

    cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper. pdf. Cutler, J., Drusvyatskiy, D., and Harchaoui, Z. Stochas- tic optimization under distributional drift.Jour- nal of Machine Learning Research, 24(147):1–56,

  4. [4]

    Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P

    URL http://jmlr.org/papers/v24/ 21-1410.html. Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P. Koala: A Kalman opti- mization algorithm with loss adaptivity.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (6):6471–6479, Jun. 2022. doi: 10.1609/aaai.v36i6. 20599. URL https://ojs.aaai.org/index. php/AAAI/art...

  5. [5]

    Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods

    URL https://proceedings.mlr.press/ v97/ghorbani19b.html. Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods. Curran Associates Inc., Red Hook, NY , USA, 2019. Godichon-Baggioni, A., Lu, W., and Portier, B. Online estimation of the inverse of the hessian for stochastic op- timization with applicat...

  6. [6]

    doi: 10.1561/2400000013

    ISSN 2167-3888. doi: 10.1561/2400000013. URL https://doi.org/10.1561/2400000013. Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. SPAM: Spike-aware Adam with momentum reset for sta- ble LLM training. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=L9eBxTCpQG. Ilboudo, W. E. L.,...

  7. [7]

    Jagannath, A., Jones-McCormick, T., and Sarangian, V

    doi: 10.1109/TNNLS.2020.3041755. Jagannath, A., Jones-McCormick, T., and Sarangian, V . High-dimensional limit theorems for SGD: Momentum and adaptive step-sizes, 2025. URL https://arxiv. org/abs/2511.03952. arXiv:2511.03952. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest direction...

  8. [8]

    Jelassi, S

    URL https://openreview.net/forum? id=SkgEaj05t7. Jelassi, S. and Li, Y . Towards understanding how momentum improves generalization in deep learning,

  9. [9]

    Jentzen, A., Kranz, J., and Riekert, A

    URL https://openreview.net/forum? id=lf0W6tcWmh-. Jentzen, A., Kranz, J., and Riekert, A. PADAM: Parallel av- eraged Adam reduces the error for stochastic optimization in scientific machine learning, 2025. URL https:// arxiv.org/abs/2505.22085. arXiv:2505.22085. Kingma, D. P. and Ba, J. Adam: A method for stochas- tic optimization. In Bengio, Y . and LeCu...

  10. [10]

    org/CorpusID:15039054

    URL https://api.semanticscholar. org/CorpusID:15039054. 10 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Li, X., Luo, J., Zheng, Z., Wang, H., Luo, L., Wen, L., Wu, L., and Xu, S. On the performance analysis of momentum method: A frequency domain perspective. InThe Thirteenth International Conference on Learning Re...

  11. [11]

    URL https://proceedings.mlr.press/ v237/liao24a.html. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for llm train...

  12. [12]

    Luo, L., Xiong, Y ., and Liu, Y

    URL https://openreview.net/forum? id=Syxt5oC5YQ. Luo, L., Xiong, Y ., and Liu, Y . Adaptive gradient methods with dynamic bound of learning rate. InIn- ternational Conference on Learning Representations,

  13. [13]

    Nesterov, Y

    URL https://openreview.net/forum? id=Bkg3g2R9FX. Nesterov, Y . A method for solving the convex programming problem with convergence rateo(1/k2).Dokl Akad Nauk SSSR, 269:543, 1983. URL https://cir.nii.ac. jp/crid/1370862715914709505. Nesterov, Y . Introductory lectures on convex opti- mization - a basic course. InApplied Optimization,

  14. [14]

    The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space

    URL https://api.semanticscholar. org/CorpusID:62288331. Ollivier, Y . The extended Kalman filter is a natural gradi- ent descent in trajectory space, 2019. URL https:// arxiv.org/abs/1901.00696. arXiv:1901.00696. Papyan, V . Traces of class/cross-class structure pervade deep learning spectra.Journal of Machine Learning Research, 21(252):1–64, 2020. URL ht...

  15. [15]

    URL https://arxiv.org/abs/2506. 11904. arXiv:2506.11904. Vuckovic, J. Kalman gradient descent: Adap- tive variance reduction in stochastic optimization,

  16. [16]

    URL https://arxiv.org/abs/1810. 12273. arXiv:1810.12273. Vyas, N., Morwani, D., Zhao, R., Shapira, I., Brandfon- brener, D., Janson, L., and Kakade, S. M. SOAP: Im- proving and stabilizing shampoo using Adam for lan- guage modeling. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=IDxZhXrp...

  17. [17]

    URL https: //doi.org/10.1155/2017/9478952

    doi: 10.1155/2017/9478952. URL https: //doi.org/10.1155/2017/9478952. Article ID 9478952, 9 pages. Yang, X. Towards stochastic gradient variance reduction by solving a filtering problem, 2023. URL https:// openreview.net/forum?id=0sxmoci9Ma. Yao, Z., Gholami, A., Shen, S., Mustafa, M., Keutzer, K., and Mahoney, M. ADAHESSIAN: An adaptive second order opti...

  18. [18]

    ADADELTA: An Adaptive Learning Rate Method

    URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3-Paper. pdf. Zeiler, M. D. Adadelta: An adaptive learning rate method, 2012. URL https://arxiv.org/abs/ 1212.5701. arXiv:1212.5701. Zhang, J. and Mitliagkas, I. Yellowfin and the art of momentum tuning. In Talwalkar, A., Smith, V ., and Zaharia, M. (eds.),Proc...

  19. [19]

    org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper

    URL https://proceedings.mlsys. org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper. pdf. Zhang, T. Solving large scale linear prediction prob- lems using stochastic gradient descent algorithms. In 12 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Proceedings of the Twenty-First International Con-...

  20. [20]

    Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H

    URL https://www.sciencedirect.com/ science/article/pii/S0167739X18332540. Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H. Adaptiv- ity and non-stationarity: Problem-dependent dynamic re- gret for online convex optimization.Journal of Machine Learning Research, 25(98):1–52, 2024. URL http: //jmlr.org/papers/v25/21-0748.html. Zhou, K., Jin, Y ., Ding, Q...

  21. [21]

    time- resolved

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ eddea82ad2755b24c4e168c5fc2ebd40-Paper. pdf. Ziyin, L., Wang, Z. T., and Ueda, M. Laprop: Separating mo- mentum and adaptivity in Adam, 2021. URL https:// arxiv.org/abs/2002.04839. arXiv:2002.04839. 13 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization App...

  22. [22]

    Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s

    (Martingale difference noise)Define the conditional mean gradient mt+1(θ) =E ∇θg(θ, Xt+1)| F t , and the gradient noise ξt+1(θ) =∇ θg(θ, Xt+1)−m t+1(θ). Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s. for allθ∈R d. Under this framework, SGD admits the decomposition θt+1 =θ t −γ tmt+1(θt)−γ tξt+1(θt).(SGD) For momentum, we use a two-par...

  23. [23]

    (Gradient noise along iterates)The gradient noise ξt+1 satisfies E ∥ξt+1(θt)∥2 ≤σ 2 and E ∥ξt+1(ψt)∥2 ≤ 18 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization σ2 a.s. C.1. Proof for SGD tracking error bound First we will prove a recursive relation for the tracking error that we will subsequently use for our expectation a...

  24. [24]

    eθt eθt−1 # = 1 1−β

    (Step-decay schedule in the low drift-to-noise regime).Suppose γ⋆ <1/2L (equivalently, the minimizer of E(γ) is not at the smoothness cap), so that γ⋆ = 8∆2 µσ2 1/3 ,E= 3 ∆σ2 µ2 2/3 . Define epochsk= 0,1, . . . , K−1with γ0 := 1 2L , γ k := γk−1 +γ ⋆ 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ m , and epoch lengths T0 := l 2 µγ0 log 2∥θ0 −θ ⋆ 0∥2 E(γ0) m , T k := l2 lo...

  25. [25]

    mode-splitting

    (Step-decay schedule with momentum restart).Suppose γ⋆ β < µ(1−β) 2/(4L2) (i.e., the minimizer of Eβ(γ) is not at the stability cap). Define the epoch stepsizes γ0 := µ(1−β) 2 4L2 , γ k := γk−1 +γ ⋆ β 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ β m . 31 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Define epoch lengths T0 := l1−...

  26. [26]

    (Minimizers.) Eachg u admits a unique minimizerθ ⋆ u, andθ ⋆ u =uae 1

  27. [27]

    Proof of Lemma E.4

    (Discrepancy.) The two-point discrepancy satisfiesχ(g +, g−)≥µa 2/8. Proof of Lemma E.4. On the Euclidean ball {θ:∥θ∥ 2 ≤r/2} , we have ψr(θ) = 1 and ∇ψr(θ) = 0 by construction of the bump. Therefore, for all ∥θ∥2 ≤r/2 , ∇gu(θ) =µθ−uµae 1. Setting ∇gu(θ) = 0 yields θ=uae 1, and this point indeed lies in the region where the above simplification holds sinc...

  28. [28]

    The Markov chain (St)t≥0 admits a unique stationary distribution π∞, which is a centered Gaussian N(0,Σ ∞) on R2

  29. [29]

    The stationary covarianceΣ ∞ is the unique positive semidefinite solution of the discrete-time Lyapunov equation Σ =AΣA ⊤ +σ 2BB ⊤.(172)

  30. [30]

    WritingΣ ∞ = ( v c c v )withv= Var π∞(xt)andc= Cov π∞(xt, xt−1), we have the exact closed form v= (1 +β)γ 2σ2 (1−β) (1 +β) 2 −(1 +β−γµ) 2 = (1 +β)γσ 2 (1−β)µ 2(1 +β)−γµ .(173)

  31. [31]

    Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7

    In the stable regime one necessarily has γµ <2(1 +β) , so the denominator in (173) is positive. Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7. We proceed in four steps: (1) existence/uniqueness of a stationary law, (2) characterization via a Lyapunov equation, (3) explicit solution of the(1,1)e...

  32. [32]

    Gaussian-gradient KL bound:from Lemma E.12, we have supu,v∈U DKL(P π u ∥P π v )≤Cµ 2a2T /(σ2/(1−β)),∀π

  33. [33]

    LetV T >0be the variation budget

    Packing size lower bound:from Lemma E.9, we have the packing size lower bound log|U | ≥c J for a universal constantc >0(e.g., from Varshamov–Gilbert or constant-weight codes). LetV T >0be the variation budget. Choose a:= VT C µ J1/q 1/α .(209) ThenGVar p,q(Gu 1:T )≤V T for allu∈ U. Moreover, ifJadditionally satisfies C µ2a2 σ2/(1−β) T≤ c 2 J,(210) then Fa...

  34. [34]

    sub-Gaussian × sub-Gaussian ⇒ sub-exponential

    Subsequently invoking Lemma E.1, we conclude that there does not exist an admissible policy π∈Π β such that supg∈Gp,q(VT ) Rπ,ϕG T (g)≤ 1 9 inf g,˜g∈Θ PT t=1 χ(gt,˜gt). This gives us a lower bound for the minimax regret. Now we must just enforce the gradient variational budget and the information constraint. To enforce the gradient variational budget, we ...

  35. [35]

    for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

    IfG 1 ⊆ G2 and∥X| G 2∥Ψ2 ≤Ka.s. for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

  36. [36]

    for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

    If|Y| ≤Ka.s. for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

  37. [37]

    IfXis independent ofG, then∥X| G∥ Ψα =∥X∥ Ψα a.s

  38. [38]

    IfX⊥Y| G, thenE[XY| G] =E[X| G]E[Y| G]a.s., andE[X|Y,G] =E[X| G]a.s. Proof. (i) By assumption, E[exp(|X/K|2)| G 2]≤2 a.s.; taking G1-conditional expectations and using the tower property yields E[exp(|X/K|2)| G 1]≤2 a.s., hence ∥X| G 1∥Ψ2 ≤K . (ii) If ϕ is admissible for X given G, then Kϕ is admissible for XY given G since |XY|/(Kϕ)≤ |X|/ϕ . Taking essen...

  39. [39]

    If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case

    (Vector).Let X∈R d. If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case. Bye y ≥1 +yfory≥0, 2≥E 1 + |X|2 K2 F F = 1 + 1 K2 F E[|X|2 | F]a.s., soE[|X| 2 | F]≤K 2 F. The vector case follows very similarly. Fix any F-measurable u∈S d−1 and apply the scalar part to...

  40. [40]

    Then E[Mτ] =E[M 0]

    (Unbounded stopping time via truncation).Assume E[|Mτ∧n |]<∞ for all n and that {Mτ∧n }n≥1 is uniformly integrable. Then E[Mτ] =E[M 0]. More generally, if(Mt) is an integrable supermartingale (resp. submartingale), then ineither(i) or (ii) we have E[Mτ]≤ E[M0](resp.E[M τ]≥E[M 0]). Proof.We first show that the stopped process is a martingale: Step 1: Show ...