When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation
Pith reviewed 2026-05-14 18:52 UTC · model grok-4.3
The pith
Optimistic multiplicative weights updates converge linearly in KL divergence to unique interior Nash equilibria in zero-sum games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By viewing the dual iterates of OMWU as an optimistic skew-gradient descent with respect to a specific energy function, the analysis proves that energy dissipates over the iterates. Tight bounds on the magnitude of this dissipation quantify the geometric bottlenecks that arise when primal iterates are close to the simplex boundary. This yields a new linear last-iterate convergence rate in KL divergence on games with a unique and interior Nash equilibrium, with the dependence on game-specific constants shown to be optimal. The geometric view further separates uniform convergence rates across different measures of distance to Nash.
What carries the argument
The energy function on the dual iterates, treated as optimistic skew-gradient descent, whose dissipation magnitude is bounded to quantify convergence slowdowns near the simplex boundary.
If this is right
- Linear last-iterate convergence in KL divergence holds for any game with a unique interior Nash equilibrium.
- The rate's dependence on game-specific constants is tight and cannot be improved in general.
- Uniform best-iterate convergence admits constant lower bounds in KL divergence and total variation distance from Nash.
- Best-iterate convergence in duality gap reaches ilde O(T^{-1/2}) in the 2x2 setting.
- Uniform convergence guarantees do not transfer across different distance measures to Nash.
Where Pith is reading between the lines
- The energy-dissipation lens could be tested on other optimistic online-learning algorithms to see whether similar geometric bottlenecks appear.
- Checking the predicted linear rate numerically on random games with interior equilibria would provide direct validation of the optimality result.
- The separation across distance measures suggests that algorithm designers should select the target metric before claiming uniform convergence.
Load-bearing premise
The chosen energy function must capture the dominant dynamics of the OMWU iterates and remain dissipative under the stated conditions for the linear rate and optimality claims to hold.
What would settle it
A concrete two-player zero-sum game with a unique interior Nash equilibrium in which the last-iterate KL divergence of OMWU fails to converge at a linear rate would falsify the central rate claim.
Figures
read the original abstract
This paper studies the convergence of the Optimistic Multiplicative Weights Update algorithm (OMWU) in two player zero-sum games. Recent works have identified instances on which the last-iterate of OMWU can converge arbitrarily slowly, but understanding when and why this slow convergence occurs has remained open. In this work, we develop a new analysis framework that gives sharp, quantitative explanations for this behavior. Our analysis is based on viewing the algorithm's dual iterates as an optimistic skew-gradient descent with respect to an energy function. We prove over the dual iterates that energy is dissipative, and by establishing tight bounds on the magnitude of dissipation, our analysis quantifies the geometric bottlenecks that arise when the corresponding primal iterates are close to the simplex boundary. This further translates into a new linear last-iterate convergence rate in KL divergence on games with a unique and interior Nash equilibrium. Compared to prior work, this new rate contains a much sharper dependence on game-specific constants, and we prove this dependence is optimal. Moreover, these geometric insights further translate into new separations on uniform convergence rates for OMWU. On the one hand, we prove constant lower bounds on the uniform best-iterate convergence rate in KL divergence and total variation distance from Nash. On the other hand, we establish for the $2\times 2$ setting a new ${\widetilde O}(T^{-1/2})$ best-iterate rate in duality gap, improving substantially over prior work. Together, this shows in general that uniform convergence rate guarantees do not transfer across different measures of distance to Nash.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper develops a geometric analysis framework for the Optimistic Multiplicative Weights Update (OMWU) algorithm in two-player zero-sum games. It models the dual iterates as optimistic skew-gradient descent on a specific energy function, proves that the energy is dissipative, and derives tight bounds on the magnitude of dissipation. These bounds quantify geometric bottlenecks that arise when primal iterates approach the simplex boundary. The framework yields a new linear last-iterate convergence rate in KL divergence for games with a unique interior Nash equilibrium, with a sharper dependence on game constants that is shown to be optimal via matching lower bounds. It further establishes separations in uniform convergence: constant lower bounds on best-iterate rates in KL divergence and total variation distance, contrasted with an improved Õ(T^{-1/2}) best-iterate rate in duality gap for the 2×2 case.
Significance. If the central claims hold, the work supplies the first sharp, quantitative explanation for the slow last-iterate behavior of OMWU and introduces a new dissipation-based tool for analyzing optimistic dynamics. The optimality result and the explicit separation between distance measures to Nash are valuable contributions to the literature on last-iterate convergence and no-regret learning in games. The geometric perspective on boundary effects may generalize to other multiplicative-weights-style algorithms.
major comments (2)
- [§4.2] §4.2, Theorem 4.3: The linear last-iterate KL rate is stated only for unique interior equilibria, yet the dissipation bound in the preceding lemma appears to degrade when any coordinate of the primal iterate approaches zero; the proof must explicitly verify that the interior-NE assumption prevents this degradation throughout the trajectory.
- [Theorem 5.1] Theorem 5.1 (optimality): The matching lower-bound construction is claimed to be optimal in its game-constant dependence, but the reduction from the constructed game to the general case should confirm that the lower-bound instance satisfies the unique-interior-NE hypothesis used in the upper bound.
minor comments (3)
- [§3] The precise definition of the energy function (introduced in §3) should be stated in a single displayed equation rather than assembled from several inline expressions, to facilitate verification of the dissipation identity.
- [Theorem 6.2] In the 2×2 duality-gap rate (Theorem 6.2), the Õ notation hides logarithmic factors whose dependence on the game parameters is not made explicit; adding this dependence would strengthen the comparison with prior work.
- [Figure 2] Figure 2 caption refers to 'energy dissipation curves' but the axes labels and legend are not described in the text; a short sentence clarifying what is plotted would improve readability.
Simulated Author's Rebuttal
Thank you for the thorough review and the recommendation for minor revision. We appreciate the positive evaluation of our contributions. Below we respond to each major comment.
read point-by-point responses
-
Referee: §4.2, Theorem 4.3: The linear last-iterate KL rate is stated only for unique interior equilibria, yet the dissipation bound in the preceding lemma appears to degrade when any coordinate of the primal iterate approaches zero; the proof must explicitly verify that the interior-NE assumption prevents this degradation throughout the trajectory.
Authors: We agree that an explicit verification is needed to ensure the dissipation bound does not degrade. Under the unique interior Nash equilibrium assumption, the last-iterate convergence in KL divergence implies that the iterates remain in a compact set away from the boundary after sufficiently many steps, but to rigorize this from the start, we will add a preliminary bound showing that the minimum probability mass is bounded below by a positive quantity depending on the initial KL divergence and the equilibrium's interior distance. This will be incorporated in the revised proof of Theorem 4.3. revision: yes
-
Referee: Theorem 5.1 (optimality): The matching lower-bound construction is claimed to be optimal in its game-constant dependence, but the reduction from the constructed game to the general case should confirm that the lower-bound instance satisfies the unique-interior-NE hypothesis used in the upper bound.
Authors: The lower bound is constructed explicitly for a game with a unique interior Nash equilibrium (for example, a 2x2 game with equilibrium at the center of the simplex). This instance satisfies the hypothesis of Theorem 4.3, allowing the constants to match directly. We will clarify in the statement of Theorem 5.1 and its proof that the constructed game has a unique interior equilibrium, confirming the optimality within the class of games considered in the upper bound. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper introduces a new analysis framework viewing OMWU dual iterates as optimistic skew-gradient descent on an energy function, then derives dissipation bounds and linear KL convergence directly from the algorithm dynamics and geometric properties of the simplex. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the central claims rest on explicit proofs of dissipation magnitude and matching lower bounds rather than renaming or smuggling prior results. The derivation chain is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math KL divergence and simplex geometry satisfy standard convexity and smoothness properties used in the dissipation analysis
invented entities (1)
-
energy function for OMWU dual iterates
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Energy gradient is globally surjective: First, we show that ∇F : Rm+n → ri(W ) is surjective over the full domain Rm+n. For this, fix w ∈ ri(W ). Recalling that R is the joint negative entropy regularizer, observe that Proposition D.4 shows that∇R(w) = v for some v ∈ Rm+n. Using the definition of∇R, it follows by Proposition D.5 that∇F(v) = ∇F(∇R(w)) = w....
-
[2]
Energy gradient is invariant to constant shifts: Now, again fix w ∈ ri(W ), and let v = ∇R(w) ∈ Rm+n. Further recall from Proposition D.3 that ∇F is invariant to constant shifts, meaning for any s ∈ S that ∇F(v + s) = ∇F(v) = w. Thus to prove ∇F is surjective over Z, it is sufficient to establish v + s ∈ Z for some s ∈ S . As v ∈ Rm+n can lie in the full ...
-
[3]
Sufficient property holds under a unique and interior NE: 27 We will prove that Z + S = Rm+n holds under the assumption of a unique and interior NE w⋆ = ( p⋆, q⋆) ∈ ri(W ). For this, first observe that the property Z + S = Rm+n is equivalent to establishing equality between the orthogonal complements of the two sets. Since Z, S are linear subspaces, we ha...
-
[4]
Existence of minimizer in effective dual space: Recall that F is convex and continuously differentiable. Thus over the linear subspace Z ⊂ Rm+n, the first-order optimality conditions for F give (see, e.g., Boyd and Vandenberghe (2004), Sec. 4.2.3): z ∈ argminz′∈Z F(z′) ⇐ ⇒ ⟨∇ F(z), z′′⟩ = 0 for all z′′ ∈ Z . Now due to Proposition D.9, ∇F : Z → ri(W ) is ...
work page 2004
-
[5]
Minimum energy function value attained: Recall from expression (32) in the proof of Proposition D.10 that Z ∩ S = Span(Jw ⋆). Thus fixing z⋆ ∈ argminz∈Z F(z) from Step (1), then also zτ = z⋆ + τJw ⋆ ∈ argminz∈Z F(z) for any τ ∈ R. This is because ∇F(z⋆ + s) = ∇F(z⋆) for any s ∈ S , and since Jw ⋆ ∈ S . Thus z′ ∈ Z is in the argmin set if and only if z′ = ...
-
[6]
Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆
Apply Fenchel-Young identity: Now by Part (ii) of the Fenchel-Young identity from Proposition D.2, we have for all z ∈ Z and w = ∇F(z) that KL(w⋆, w) = R(w⋆) + F(z) − ⟨w⋆, z⟩ = R(w⋆) + F(z) , (33) where the second equality is due to Proposition D.6. Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆. Then it follows from (33) that 0 = KL(w⋆, w⋆) = R(w⋆) + F(z⋆) = ⇒ R(w...
work page 2004
-
[7]
Exact expansion of change in energy: The starting point is to take a first-order Taylor expansion of F(zt+1) around zt. Using the structure of the (OMWU Dual) update, as well as the Hessian-based approximation of gradient differences from Proposition E.4, we derive an exact expansion of ∆F(zt) = F(zt+1) − F(zt), the one-step change in energy over the dual...
-
[8]
Controlling the error terms: Notice that the first term of (53) is exactly the desired dissipation term −η2∥J∇F(zt)∥2 zt (up to a constant factor) appearing in the statement of the lemma. Thus, the remaining technical challenge is to ensure the final three terms of (53) also scale at most like O(η2∥J∇F(zt)∥2 zt ), with a constant factor that can be made l...
-
[9]
Thus for all such L ≥ 72, it holds that 18 25 + 36 exp(6/L) 25L ≤ 149
-
[10]
Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ
■ For the inner product involving GF(zt, zt−1), we derive a similar bound: Proposition F.8. Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ . Then for t ≥ 1: η⟨J∇F(zt), GF(zt, zt−1)⟩ ≤ 18 5 · ν(6ησ) · η2∥J∇F(zt)∥2 zt . Moreover, if η ≤ 1 Lσ for L ≥ 72, then 18 25 · ν(6ησ) ≤ 54 5L + 108 exp(6/L) 5L2 ≤ 31 200. Proof. Applying the generalized Cauchy-S...
-
[11]
Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2
□ Using this new notation and relationships between constants, we now give the overall template for proving claim (i) of the proposition: General template for proving (i). Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2. For this, observe by the triangle inequality that ∥∆t−1∥zt = ∥∆t − (∆t − ∆t−1)∥zt ≤ ∥ ∆t∥zt + ∥∆t − ∆t...
-
[12]
Non-uniform restricted strong convexity. Using the definition of the minimum restricted eigenvalue M(z) from Definition G.1, the inequality of Proposition G.2, and the characterization of M(z) from Proposition G.3, we have ∥J∇F(z)∥2 z ≥ M(z) · ∥ΠS ⊥ J∇F(z) ∥2 2 ≥ wmin · ∥ΠS ⊥ J∇F(z) ∥2 2
-
[13]
Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9)
Invariance to Nash and restricted spectrum of J. Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9). Now by Proposi- tion B.1, we have J∇F(z⋆) = Jw ⋆ ∈ S . Thus by definition of the projectionΠS ⊥, we have ΠS ⊥ (J∇F(z⋆)) = 0, and therefore by linearity ΠS ⊥ (J∇F(z)) = ΠS ⊥ (J(∇F(z) − ∇F(z⋆))) . Moreover, as w, w⋆ ∈ W , ...
-
[14]
We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆)
Bounding the dual suboptimality gap. We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆). We proceed in two steps. First, we have from Lemma G.10 the dual relationship between KL(w⋆, w) and χ2(w⋆, w), which gives F(z) − F(z⋆) ≤ ∥∇ F(z) − ∇F(z⋆)∥2 z,∗ . Next, by the restricted local smoothness properties of ...
-
[15]
Equivalence between primal and dual suboptimality gaps. Combining (106) from Step 2 and (107) from Step 3 yields ∥J∇F(z)∥2 z ≥ σ2 min · w2 min · (F(z) − F(z⋆)) . (108) This is exactly a non-uniform skew-gradient-domination property for the energy function F. To relate ∥J∇F(z)∥2 z back to the primal space, we apply the equivalence of Proposition D.11 (also...
-
[16]
Following identical calculations, we similarly conclude Varq(A⊤ p) ≥ σ2 min,m · qmin · ∥p − p⋆∥2 2. Proof of part (iii). Using the relationship KL(p⋆, p) ≤ χ2(p⋆, p), we can write and further bound KL(p⋆, p) ≤ χ2(p⋆, p) = ∑ i∈[m] (p(i) − p⋆(i))2 p(i) ≤ 1 pmin · ∑ i∈[m] (p(i) − p⋆(i))2 = 1 pmin · ∥p − p⋆∥2 2 . Rearranging yields ∥p − p⋆∥2 2 ≥ pmin · KL(p⋆,...
-
[17]
■ H Details on Universal Last-Iterate Convergence in KL This section gives the proofs of Theorem 4.2 and Theorem 4.7, which establish asymptotic last-iterate convergence and a linear last-iterate convergence rate in KL divergence, respectively. Oranization of section. The section is organized as follows: • Section H.1 first proves a helper lemma that esta...
work page 2021
-
[18]
Then substituting (120) into (119) and rearranging further gives KL(w⋆, w1) ≤ KL(w⋆, w0) · 1 + 1 4 , which completes the proof. ■ 57 H.2 Proof of Theorem 4.2 – Asymptotic Last-Iterate Convergence We now give a proof of asymptotic last-iterate convergence to a unique and interior NE. We first restate the theorem: Theorem 4.2 (Asymptotic Last-Iterate Conver...
-
[19]
Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0)
Primal iterates lie in compact sublevelset: Combining Corollary D.7 and the upper bound of Lemma 4.1, we have for all t ≥ 1: KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) ≤ − 1 20 η2∥J∇F(zt)∥2 zt ≤ 0 . Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0). Together, this means thatKL(w⋆, wt) ≤ (5/4) · KL(w⋆, w0) for all t ≥ 0. Letting κ = (5/4) · K...
-
[20]
Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p)
Dissipation term is zero only at Nash: For z ∈ Rm+n, let w = ( p, q) = ∇F(z) ∈ ri(W ). Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p). By definition of variance (see expression (35)), note that Varp(v) = 0 and Varq(u) = 0 if and only if v = c1m and u = d1n for some c, d ∈ R. Together, this implies that ∥J∇F(z)∥2 z = 0 if ...
-
[21]
Subsequence and global convergence: As all wt ∈ U for the compact set U, we have by the Bolzano-Weierstrass theorem that every infinite subsequence of {wt} has at least one limit pointw∞ ∈ U . Moreover, by the continuity ofD(·), if wtk → w∞ for some subsequence {wtk }, then also D(wtk ) → D(w∞). Now as U is compact, (121) implies that KL(w⋆, wtk ) converg...
-
[22]
Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt)
Change in KL is change in energy: First, let {zt} denote the dual OMWU iterates. Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) . (122)
-
[23]
Upper bound on energy dissipation: From Lemma 4.1, we have for all t ≥ 1 under the constraint on η that F(zt+1) − F(zt) ≤ − 1 20 · η2∥J∇F(zt)∥2 zt . (123)
-
[24]
Non-uniform skew-gradient domination: By Proposition 4.4, we further have the lower bound ∥J∇F(zt)∥2 zt ≥ σ2 min · w2 t,min · KL(w⋆, wt) , (124) where σmin > 0 under the unique and interior Nash equilibrium assumption. Combining expressions (122), (123), and (124), we find for all t ≥ 1 KL(w⋆, wt+1) − KL(w⋆, wt) ≤ − 1 20 η2 · σ2 minw2 t,min · KL(w⋆, wt) ....
work page 2021
-
[25]
dual" perspective) and Proposition G.13 (proven using a slightly more involved “primal
Then combin- ing (128) and (129), we have p⋆(imin) log 1 pt,min ≤ cΛ =⇒ log 1 pt,min ≤ cΛ p⋆(imin) ≤ cΛ δp ≤ cΛ δ , where the final two inequalities are due to p⋆(imin) ≥ δp ≥ δ. Rearranging, we find pt,min ≥ exp −cΛ δ . (130) By an identical calculation, we also have qt,min ≥ exp( −cΛ δ ). As wt,min = min{pt,min, qt,min}, it follows that wt,min ≥ exp( −c...
work page 2021
-
[26]
Proposition I.5 extends this result to hold for general δp, δq ∈ (0, 1). Proof. First, let σmin,n and σmin,m be the component-wise restricted singular values from (110), and recall from Proposition G.11 that σmin = min{σmin(A, 1⊥), σmin(A⊤, 1⊥)}. Using calculations similar to the proof 65 of Part (2) of Proposition G.16, we will show that σmin(A, 1⊥) = σm...
-
[27]
For this, recall by definition that σmin(A, 1⊥) = infv∈1⊥\{0} ∥Π1⊥ (Av)∥2 ∥v∥2 . Moreover, for any v ∈ 1⊥ ⊂ R2, it follows that v = c · (−1, 1) for some c ∈ R. Then using the definition of A = Aδp,δq, and recalling that Π1⊥ = I − 1 2 11⊥ ∈ R2×2, it follows by a direct calculation that Π1⊥ (Av) = c · δp − (δp − (1 − δp))/2 −(1 − δp) − (δp − (1 − δp))/2 = c...
work page 2024
-
[28]
Let p = ( p(1), p(2)) and q = (q(1), q(2))
Exact Characterization of Variance in 2 × 2 Setting. Let p = ( p(1), p(2)) and q = (q(1), q(2)). We first prove that Varp(Aq) ≤ 2 · p(1) · p(2) · ∥A∥2 2 · ∥q − q⋆∥2 2 (146) and Var q(A⊤ p) ≤ 2 · q(1) · q(2) · ∥A∥2 2 · ∥p − p⋆∥2 2 . (147) To prove (146), let v = Aq ∈ R2. By definition of Varp(v) from (35), it is straightforward to compute in the two-dimens...
-
[29]
Relating KL to Euclidean Distance to Nash. Next, we prove the following relationships between KL and euclidean distance to Nash: ∥p − p⋆∥2 2 ≤ 4 · max(pmin, δp) · KL(p⋆, p) (148) and ∥q − q⋆∥2 2 ≤ 4 · max(qmin, δq) · KL(q⋆, q) . (149) We start by proving (148). For this, recall from Section 3 that for w = ( p, q) ∈ W , we write R(w) = Rm(q) + Rn(q) to den...
work page 2006
-
[30]
Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp)
In particular, this also implies that ps(2) = min(ps(1), ps(2)) ≤ 1 2 for all s ∈ [0, 1]. Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp) . Substituting this uniform bound into (151), and using the fact that R 1 0 (1 − s)ds = 1 2, we find KL(p⋆, p) ≥ 1 4 · ∥p − p⋆∥2 2 · 1 max(pmin, δp) . Rearranging then ...
-
[31]
Combining the Pieces. Combining the inequalities of (146) and (147) from Step 1 and expressions (148) and (149) from Step 2, we find Varp(Aq) ≤ 8 · ∥A∥2 2 · pmin · max(qmin, δ) · KL(q⋆, q) and Var q(A⊤ p) ≤ 8 · ∥A∥2 2 · qmin · max(pmin, δ) · KL(p⋆, p) . (152) Here, we additionally used the fact that p(1)p(2) = ( 1 − p(2))p(2) ≤ p(2) = pmin and similarly q...
work page 2025
-
[32]
Otherwise, if δp, δq ≥ 1 10, then the universal last-iterate bound of Theorem 4.7, together with Corollary B.6, would directly give KL(w⋆, wT) ≤ O(T−1/2) for all T ≥ 1. 82 Assumptions on Initialization. We assume that the initialization w0 = ( p0, q0) ∈ ri(W ) is the uniform distribution p0 = (1/2, 1/2) and q0 = (1/2, 1/2). Considering Canonical 2 × 2 Mat...
work page 2025
-
[33]
Let A := Aδp,δq ∈ [−1, 1]2×2 be the matrix from Definition I.1. Let {wt} and {zt} be the primal and dual iterates of OMWU for η satisfying Assumption 2 and initialized from the uniform distribution. For each t ≥ 0, let zt = ( xt, yt) and let wt = ( pt, qt). Then for all t ≥ 0, the following bounds hold: (i) F (zt) < F(zt−1) < · · · < F(z1) ≤ 5 4 · F(z0) ≤...
-
[34]
Then by the universal one-step multiplicative change in KL divergence from Theorem 4.7 (and in particular, using the expression from Corollary H.3, recall that for every such t where pt,min ≥ δp/3 and qt,min ≥ δq/3, we have KL(w⋆, wt+1) ≤ KL(w⋆, wt) · exp − 1 20 η2σ2 min · pt,min · qt,min ≤ KL(w⋆, wt) · exp − 1 720 η2δpδq . By definition of Tstartup, Prop...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.