When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation

Anas Barakat; Andre Wibisono; Antonios Varvitsiotis; Georgios Piliouras; John Lazarsfeld

arxiv: 2605.13242 · v1 · pith:KDYMTDICnew · submitted 2026-05-13 · 💻 cs.GT · cs.LG

When and Why is Optimistic Multiplicative Weights Slow? The Geometry of Energy Dissipation

John Lazarsfeld , Anas Barakat , Georgios Piliouras , Antonios Varvitsiotis , Andre Wibisono This is my paper

Pith reviewed 2026-05-14 18:52 UTC · model grok-4.3

classification 💻 cs.GT cs.LG

keywords optimistic multiplicative weights updatelast-iterate convergencezero-sum gamesenergy dissipationKL divergencesimplex boundaryNash equilibriumuniform convergence

0 comments

The pith

Optimistic multiplicative weights updates converge linearly in KL divergence to unique interior Nash equilibria in zero-sum games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the Optimistic Multiplicative Weights Update algorithm in two-player zero-sum games by modeling its dual iterates as optimistic skew-gradient descent on an energy function. It proves that this energy dissipates and derives tight bounds on the dissipation to identify geometric bottlenecks that appear when the corresponding primal iterates approach the simplex boundary. These bounds yield a linear last-iterate convergence rate in KL divergence for games possessing a unique interior Nash equilibrium, and the authors establish that the dependence on game constants is optimal. The same framework also produces constant lower bounds on uniform best-iterate rates in KL divergence and total variation while giving an improved best-iterate rate in duality gap for 2x2 games, demonstrating that uniform convergence guarantees fail to transfer across distance measures to Nash.

Core claim

By viewing the dual iterates of OMWU as an optimistic skew-gradient descent with respect to a specific energy function, the analysis proves that energy dissipates over the iterates. Tight bounds on the magnitude of this dissipation quantify the geometric bottlenecks that arise when primal iterates are close to the simplex boundary. This yields a new linear last-iterate convergence rate in KL divergence on games with a unique and interior Nash equilibrium, with the dependence on game-specific constants shown to be optimal. The geometric view further separates uniform convergence rates across different measures of distance to Nash.

What carries the argument

The energy function on the dual iterates, treated as optimistic skew-gradient descent, whose dissipation magnitude is bounded to quantify convergence slowdowns near the simplex boundary.

If this is right

Linear last-iterate convergence in KL divergence holds for any game with a unique interior Nash equilibrium.
The rate's dependence on game-specific constants is tight and cannot be improved in general.
Uniform best-iterate convergence admits constant lower bounds in KL divergence and total variation distance from Nash.
Best-iterate convergence in duality gap reaches ilde O(T^{-1/2}) in the 2x2 setting.
Uniform convergence guarantees do not transfer across different distance measures to Nash.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The energy-dissipation lens could be tested on other optimistic online-learning algorithms to see whether similar geometric bottlenecks appear.
Checking the predicted linear rate numerically on random games with interior equilibria would provide direct validation of the optimality result.
The separation across distance measures suggests that algorithm designers should select the target metric before claiming uniform convergence.

Load-bearing premise

The chosen energy function must capture the dominant dynamics of the OMWU iterates and remain dissipative under the stated conditions for the linear rate and optimality claims to hold.

What would settle it

A concrete two-player zero-sum game with a unique interior Nash equilibrium in which the last-iterate KL divergence of OMWU fails to converge at a linear rate would falsify the central rate claim.

Figures

Figures reproduced from arXiv: 2605.13242 by Anas Barakat, Andre Wibisono, Antonios Varvitsiotis, Georgios Piliouras, John Lazarsfeld.

**Figure 1.** Figure 1: The primal and dual iterates of OMWU on two instances of a 2 × 2 zero-sum game. The algorithm is run for T = 1500 iterations with stepsize η = 0.2 in both instances. The Nash equilibrium (and corresponding dual equilibrium point) is denoted by the blue star; the OMWU initializations by the green circle; the T’th primal and dual iterates by the red square. In the left instance, the Nash equilibrium (NE) is … view at source ↗

**Figure 2.** Figure 2: Dependence of δ in linear rate. In [PITH_FULL_IMAGE:figures/full_fig_p062_2.png] view at source ↗

**Figure 3.** Figure 3: Plot of KL(w ⋆ , wt) of OMWU in log-scale over time on instantiations of 2x2 and 10x10 payoff matrices with varying values of δ. The minimum NE coordinate is δ for the 2x2 instances and δ/5 for the 10x10 instances. Observe that the one-step change in KL decays non-uniformly over time, as suggested by Part (1) of Theorem 4.7. I Details on OMWU Dynamics in 2x2 Setting This section gives additional preliminar… view at source ↗

**Figure 4.** Figure 4: Uniform NE. Here, p ⋆ = q ⋆ = (0.5, 0.5) [PITH_FULL_IMAGE:figures/full_fig_p072_4.png] view at source ↗

**Figure 5.** Figure 5: Boundary NE in one component. Here, p ⋆ = (0.9, 0.1) and q ⋆ = (0.5, 0.5) [PITH_FULL_IMAGE:figures/full_fig_p072_5.png] view at source ↗

**Figure 6.** Figure 6: DG(w), TV(w ⋆ , w), and KL(w ⋆ , w) levelsets in the 2 × 2 setting for p ⋆ = q ⋆ = (0.9, 0.1). Discussion. Observe that the geometry of the levelsets of each distance function is highly dependent on the location of the Nash equilibrium w ⋆ . In particular, when w ⋆ is closer to the simplex boundary ( [PITH_FULL_IMAGE:figures/full_fig_p073_6.png] view at source ↗

read the original abstract

This paper studies the convergence of the Optimistic Multiplicative Weights Update algorithm (OMWU) in two player zero-sum games. Recent works have identified instances on which the last-iterate of OMWU can converge arbitrarily slowly, but understanding when and why this slow convergence occurs has remained open. In this work, we develop a new analysis framework that gives sharp, quantitative explanations for this behavior. Our analysis is based on viewing the algorithm's dual iterates as an optimistic skew-gradient descent with respect to an energy function. We prove over the dual iterates that energy is dissipative, and by establishing tight bounds on the magnitude of dissipation, our analysis quantifies the geometric bottlenecks that arise when the corresponding primal iterates are close to the simplex boundary. This further translates into a new linear last-iterate convergence rate in KL divergence on games with a unique and interior Nash equilibrium. Compared to prior work, this new rate contains a much sharper dependence on game-specific constants, and we prove this dependence is optimal. Moreover, these geometric insights further translate into new separations on uniform convergence rates for OMWU. On the one hand, we prove constant lower bounds on the uniform best-iterate convergence rate in KL divergence and total variation distance from Nash. On the other hand, we establish for the $2\times 2$ setting a new ${\widetilde O}(T^{-1/2})$ best-iterate rate in duality gap, improving substantially over prior work. Together, this shows in general that uniform convergence rate guarantees do not transfer across different measures of distance to Nash.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OMWU slow last-iterate behavior gets a clean geometric explanation via energy dissipation, plus an optimal linear KL rate for interior unique Nash and some measure-specific separations.

read the letter

The core contribution is a new energy-dissipation lens on OMWU that pins down exactly why last-iterate convergence can drag near the simplex boundary and then converts that into a linear KL rate whose dependence on game constants is shown tight for the interior unique-Nash case. That rate improves on earlier bounds and the optimality argument via matching lower bounds is the part that actually moves the needle for algorithm design. The same geometric view also produces clean separations: constant lower bounds on uniform convergence in KL and total variation, but a faster Õ(T^{-1/2}) best-iterate duality-gap rate in 2x2 games. Those distinctions are useful because they show uniform guarantees do not transfer across distances. The framework treats the dual iterates as optimistic skew-gradient descent on a specific energy and derives dissipation bounds directly from the dynamics rather than fitting them post hoc. That approach looks internally consistent and avoids the circularity that sometimes appears in rate analyses. The main soft spot is that the energy function is tailored to OMWU; it is not obvious how far the same construction travels to other optimistic methods or to settings without a unique interior equilibrium. The abstract claims the proofs are complete, but without seeing the explicit energy definition and the tight-bound derivations it is hard to judge whether the dissipation magnitude is controlled in all regimes or only under the stated step-size and initialization conditions. Still, the geometric story matches the observed slow cases and the interior linear rate is a concrete advance. This paper is for people working on last-iterate convergence in no-regret dynamics and equilibrium computation. It deserves a serious referee because the optimality claim and the measure separations are falsifiable and the analysis supplies a reusable tool rather than just another rate. I would send it out for review.

Referee Report

2 major / 3 minor

Summary. This paper develops a geometric analysis framework for the Optimistic Multiplicative Weights Update (OMWU) algorithm in two-player zero-sum games. It models the dual iterates as optimistic skew-gradient descent on a specific energy function, proves that the energy is dissipative, and derives tight bounds on the magnitude of dissipation. These bounds quantify geometric bottlenecks that arise when primal iterates approach the simplex boundary. The framework yields a new linear last-iterate convergence rate in KL divergence for games with a unique interior Nash equilibrium, with a sharper dependence on game constants that is shown to be optimal via matching lower bounds. It further establishes separations in uniform convergence: constant lower bounds on best-iterate rates in KL divergence and total variation distance, contrasted with an improved Õ(T^{-1/2}) best-iterate rate in duality gap for the 2×2 case.

Significance. If the central claims hold, the work supplies the first sharp, quantitative explanation for the slow last-iterate behavior of OMWU and introduces a new dissipation-based tool for analyzing optimistic dynamics. The optimality result and the explicit separation between distance measures to Nash are valuable contributions to the literature on last-iterate convergence and no-regret learning in games. The geometric perspective on boundary effects may generalize to other multiplicative-weights-style algorithms.

major comments (2)

[§4.2] §4.2, Theorem 4.3: The linear last-iterate KL rate is stated only for unique interior equilibria, yet the dissipation bound in the preceding lemma appears to degrade when any coordinate of the primal iterate approaches zero; the proof must explicitly verify that the interior-NE assumption prevents this degradation throughout the trajectory.
[Theorem 5.1] Theorem 5.1 (optimality): The matching lower-bound construction is claimed to be optimal in its game-constant dependence, but the reduction from the constructed game to the general case should confirm that the lower-bound instance satisfies the unique-interior-NE hypothesis used in the upper bound.

minor comments (3)

[§3] The precise definition of the energy function (introduced in §3) should be stated in a single displayed equation rather than assembled from several inline expressions, to facilitate verification of the dissipation identity.
[Theorem 6.2] In the 2×2 duality-gap rate (Theorem 6.2), the Õ notation hides logarithmic factors whose dependence on the game parameters is not made explicit; adding this dependence would strengthen the comparison with prior work.
[Figure 2] Figure 2 caption refers to 'energy dissipation curves' but the axes labels and legend are not described in the text; a short sentence clarifying what is plotted would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review and the recommendation for minor revision. We appreciate the positive evaluation of our contributions. Below we respond to each major comment.

read point-by-point responses

Referee: §4.2, Theorem 4.3: The linear last-iterate KL rate is stated only for unique interior equilibria, yet the dissipation bound in the preceding lemma appears to degrade when any coordinate of the primal iterate approaches zero; the proof must explicitly verify that the interior-NE assumption prevents this degradation throughout the trajectory.

Authors: We agree that an explicit verification is needed to ensure the dissipation bound does not degrade. Under the unique interior Nash equilibrium assumption, the last-iterate convergence in KL divergence implies that the iterates remain in a compact set away from the boundary after sufficiently many steps, but to rigorize this from the start, we will add a preliminary bound showing that the minimum probability mass is bounded below by a positive quantity depending on the initial KL divergence and the equilibrium's interior distance. This will be incorporated in the revised proof of Theorem 4.3. revision: yes
Referee: Theorem 5.1 (optimality): The matching lower-bound construction is claimed to be optimal in its game-constant dependence, but the reduction from the constructed game to the general case should confirm that the lower-bound instance satisfies the unique-interior-NE hypothesis used in the upper bound.

Authors: The lower bound is constructed explicitly for a game with a unique interior Nash equilibrium (for example, a 2x2 game with equilibrium at the center of the simplex). This instance satisfies the hypothesis of Theorem 4.3, allowing the constants to match directly. We will clarify in the statement of Theorem 5.1 and its proof that the constructed game has a unique interior equilibrium, confirming the optimality within the class of games considered in the upper bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces a new analysis framework viewing OMWU dual iterates as optimistic skew-gradient descent on an energy function, then derives dissipation bounds and linear KL convergence directly from the algorithm dynamics and geometric properties of the simplex. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the central claims rest on explicit proofs of dissipation magnitude and matching lower bounds rather than renaming or smuggling prior results. The derivation chain is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The analysis introduces an energy function whose dissipation properties are central; no free parameters are mentioned. Standard mathematical assumptions on the simplex, KL divergence, and gradient descent are used.

axioms (1)

standard math KL divergence and simplex geometry satisfy standard convexity and smoothness properties used in the dissipation analysis
Invoked implicitly when treating dual iterates as skew-gradient descent

invented entities (1)

energy function for OMWU dual iterates no independent evidence
purpose: to quantify dissipation and geometric bottlenecks near simplex boundary
New modeling device introduced to explain slow convergence; no independent falsifiable prediction outside the convergence claims is stated in the abstract

pith-pipeline@v0.9.0 · 5602 in / 1493 out tokens · 77197 ms · 2026-05-14T18:52:51.779649+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

For this, fix w ∈ ri(W )

Energy gradient is globally surjective: First, we show that ∇F : Rm+n → ri(W ) is surjective over the full domain Rm+n. For this, fix w ∈ ri(W ). Recalling that R is the joint negative entropy regularizer, observe that Proposition D.4 shows that∇R(w) = v for some v ∈ Rm+n. Using the definition of∇R, it follows by Proposition D.5 that∇F(v) = ∇F(∇R(w)) = w....

work page
[2]

Further recall from Proposition D.3 that ∇F is invariant to constant shifts, meaning for any s ∈ S that ∇F(v + s) = ∇F(v) = w

Energy gradient is invariant to constant shifts: Now, again fix w ∈ ri(W ), and let v = ∇R(w) ∈ Rm+n. Further recall from Proposition D.3 that ∇F is invariant to constant shifts, meaning for any s ∈ S that ∇F(v + s) = ∇F(v) = w. Thus to prove ∇F is surjective over Z, it is sufficient to establish v + s ∈ Z for some s ∈ S . As v ∈ Rm+n can lie in the full ...

work page
[3]

For this, first observe that the property Z + S = Rm+n is equivalent to establishing equality between the orthogonal complements of the two sets

Sufficient property holds under a unique and interior NE: 27 We will prove that Z + S = Rm+n holds under the assumption of a unique and interior NE w⋆ = ( p⋆, q⋆) ∈ ri(W ). For this, first observe that the property Z + S = Rm+n is equivalent to establishing equality between the orthogonal complements of the two sets. Since Z, S are linear subspaces, we ha...

work page
[4]

Thus over the linear subspace Z ⊂ Rm+n, the first-order optimality conditions for F give (see, e.g., Boyd and Vandenberghe (2004), Sec

Existence of minimizer in effective dual space: Recall that F is convex and continuously differentiable. Thus over the linear subspace Z ⊂ Rm+n, the first-order optimality conditions for F give (see, e.g., Boyd and Vandenberghe (2004), Sec. 4.2.3): z ∈ argminz′∈Z F(z′) ⇐ ⇒ ⟨∇ F(z), z′′⟩ = 0 for all z′′ ∈ Z . Now due to Proposition D.9, ∇F : Z → ri(W ) is ...

work page 2004
[5]

Thus fixing z⋆ ∈ argminz∈Z F(z) from Step (1), then also zτ = z⋆ + τJw ⋆ ∈ argminz∈Z F(z) for any τ ∈ R

Minimum energy function value attained: Recall from expression (32) in the proof of Proposition D.10 that Z ∩ S = Span(Jw ⋆). Thus fixing z⋆ ∈ argminz∈Z F(z) from Step (1), then also zτ = z⋆ + τJw ⋆ ∈ argminz∈Z F(z) for any τ ∈ R. This is because ∇F(z⋆ + s) = ∇F(z⋆) for any s ∈ S , and since Jw ⋆ ∈ S . Thus z′ ∈ Z is in the argmin set if and only if z′ = ...

work page
[6]

Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆

Apply Fenchel-Young identity: Now by Part (ii) of the Fenchel-Young identity from Proposition D.2, we have for all z ∈ Z and w = ∇F(z) that KL(w⋆, w) = R(w⋆) + F(z) − ⟨w⋆, z⟩ = R(w⋆) + F(z) , (33) where the second equality is due to Proposition D.6. Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆. Then it follows from (33) that 0 = KL(w⋆, w⋆) = R(w⋆) + F(z⋆) = ⇒ R(w...

work page 2004
[7]

Exact expansion of change in energy: The starting point is to take a first-order Taylor expansion of F(zt+1) around zt. Using the structure of the (OMWU Dual) update, as well as the Hessian-based approximation of gradient differences from Proposition E.4, we derive an exact expansion of ∆F(zt) = F(zt+1) − F(zt), the one-step change in energy over the dual...

work page
[8]

Controlling the error terms: Notice that the first term of (53) is exactly the desired dissipation term −η2∥J∇F(zt)∥2 zt (up to a constant factor) appearing in the statement of the lemma. Thus, the remaining technical challenge is to ensure the final three terms of (53) also scale at most like O(η2∥J∇F(zt)∥2 zt ), with a constant factor that can be made l...

work page
[9]

Thus for all such L ≥ 72, it holds that 18 25 + 36 exp(6/L) 25L ≤ 149

work page
[10]

Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ

■ For the inner product involving GF(zt, zt−1), we derive a similar bound: Proposition F.8. Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ . Then for t ≥ 1: η⟨J∇F(zt), GF(zt, zt−1)⟩ ≤ 18 5 · ν(6ησ) · η2∥J∇F(zt)∥2 zt . Moreover, if η ≤ 1 Lσ for L ≥ 72, then 18 25 · ν(6ησ) ≤ 54 5L + 108 exp(6/L) 5L2 ≤ 31 200. Proof. Applying the generalized Cauchy-S...

work page
[11]

Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2

□ Using this new notation and relationships between constants, we now give the overall template for proving claim (i) of the proposition: General template for proving (i). Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2. For this, observe by the triangle inequality that ∥∆t−1∥zt = ∥∆t − (∆t − ∆t−1)∥zt ≤ ∥ ∆t∥zt + ∥∆t − ∆t...

work page
[12]

Non-uniform restricted strong convexity. Using the definition of the minimum restricted eigenvalue M(z) from Definition G.1, the inequality of Proposition G.2, and the characterization of M(z) from Proposition G.3, we have ∥J∇F(z)∥2 z ≥ M(z) · ∥ΠS ⊥ J∇F(z) ∥2 2 ≥ wmin · ∥ΠS ⊥ J∇F(z) ∥2 2

work page
[13]

Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9)

Invariance to Nash and restricted spectrum of J. Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9). Now by Proposi- tion B.1, we have J∇F(z⋆) = Jw ⋆ ∈ S . Thus by definition of the projectionΠS ⊥, we have ΠS ⊥ (J∇F(z⋆)) = 0, and therefore by linearity ΠS ⊥ (J∇F(z)) = ΠS ⊥ (J(∇F(z) − ∇F(z⋆))) . Moreover, as w, w⋆ ∈ W , ...

work page
[14]

We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆)

Bounding the dual suboptimality gap. We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆). We proceed in two steps. First, we have from Lemma G.10 the dual relationship between KL(w⋆, w) and χ2(w⋆, w), which gives F(z) − F(z⋆) ≤ ∥∇ F(z) − ∇F(z⋆)∥2 z,∗ . Next, by the restricted local smoothness properties of ...

work page
[15]

Combining (106) from Step 2 and (107) from Step 3 yields ∥J∇F(z)∥2 z ≥ σ2 min · w2 min · (F(z) − F(z⋆))

Equivalence between primal and dual suboptimality gaps. Combining (106) from Step 2 and (107) from Step 3 yields ∥J∇F(z)∥2 z ≥ σ2 min · w2 min · (F(z) − F(z⋆)) . (108) This is exactly a non-uniform skew-gradient-domination property for the energy function F. To relate ∥J∇F(z)∥2 z back to the primal space, we apply the equivalence of Proposition D.11 (also...

work page
[16]

Proof of part (iii)

Following identical calculations, we similarly conclude Varq(A⊤ p) ≥ σ2 min,m · qmin · ∥p − p⋆∥2 2. Proof of part (iii). Using the relationship KL(p⋆, p) ≤ χ2(p⋆, p), we can write and further bound KL(p⋆, p) ≤ χ2(p⋆, p) = ∑ i∈[m] (p(i) − p⋆(i))2 p(i) ≤ 1 pmin · ∑ i∈[m] (p(i) − p⋆(i))2 = 1 pmin · ∥p − p⋆∥2 2 . Rearranging yields ∥p − p⋆∥2 2 ≥ pmin · KL(p⋆,...

work page
[17]

Oranization of section

■ H Details on Universal Last-Iterate Convergence in KL This section gives the proofs of Theorem 4.2 and Theorem 4.7, which establish asymptotic last-iterate convergence and a linear last-iterate convergence rate in KL divergence, respectively. Oranization of section. The section is organized as follows: • Section H.1 first proves a helper lemma that esta...

work page 2021
[18]

■ 57 H.2 Proof of Theorem 4.2 – Asymptotic Last-Iterate Convergence We now give a proof of asymptotic last-iterate convergence to a unique and interior NE

Then substituting (120) into (119) and rearranging further gives KL(w⋆, w1) ≤ KL(w⋆, w0) · 1 + 1 4 , which completes the proof. ■ 57 H.2 Proof of Theorem 4.2 – Asymptotic Last-Iterate Convergence We now give a proof of asymptotic last-iterate convergence to a unique and interior NE. We first restate the theorem: Theorem 4.2 (Asymptotic Last-Iterate Conver...

work page
[19]

Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0)

Primal iterates lie in compact sublevelset: Combining Corollary D.7 and the upper bound of Lemma 4.1, we have for all t ≥ 1: KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) ≤ − 1 20 η2∥J∇F(zt)∥2 zt ≤ 0 . Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0). Together, this means thatKL(w⋆, wt) ≤ (5/4) · KL(w⋆, w0) for all t ≥ 0. Letting κ = (5/4) · K...

work page
[20]

Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p)

Dissipation term is zero only at Nash: For z ∈ Rm+n, let w = ( p, q) = ∇F(z) ∈ ri(W ). Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p). By definition of variance (see expression (35)), note that Varp(v) = 0 and Varq(u) = 0 if and only if v = c1m and u = d1n for some c, d ∈ R. Together, this implies that ∥J∇F(z)∥2 z = 0 if ...

work page
[21]

Moreover, by the continuity ofD(·), if wtk → w∞ for some subsequence {wtk }, then also D(wtk ) → D(w∞)

Subsequence and global convergence: As all wt ∈ U for the compact set U, we have by the Bolzano-Weierstrass theorem that every infinite subsequence of {wt} has at least one limit pointw∞ ∈ U . Moreover, by the continuity ofD(·), if wtk → w∞ for some subsequence {wtk }, then also D(wtk ) → D(w∞). Now as U is compact, (121) implies that KL(w⋆, wtk ) converg...

work page
[22]

Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt)

Change in KL is change in energy: First, let {zt} denote the dual OMWU iterates. Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) . (122)

work page
[23]

Upper bound on energy dissipation: From Lemma 4.1, we have for all t ≥ 1 under the constraint on η that F(zt+1) − F(zt) ≤ − 1 20 · η2∥J∇F(zt)∥2 zt . (123)

work page
[24]

Combining expressions (122), (123), and (124), we find for all t ≥ 1 KL(w⋆, wt+1) − KL(w⋆, wt) ≤ − 1 20 η2 · σ2 minw2 t,min · KL(w⋆, wt)

Non-uniform skew-gradient domination: By Proposition 4.4, we further have the lower bound ∥J∇F(zt)∥2 zt ≥ σ2 min · w2 t,min · KL(w⋆, wt) , (124) where σmin > 0 under the unique and interior Nash equilibrium assumption. Combining expressions (122), (123), and (124), we find for all t ≥ 1 KL(w⋆, wt+1) − KL(w⋆, wt) ≤ − 1 20 η2 · σ2 minw2 t,min · KL(w⋆, wt) ....

work page 2021
[25]

dual" perspective) and Proposition G.13 (proven using a slightly more involved “primal

Then combin- ing (128) and (129), we have p⋆(imin) log 1 pt,min ≤ cΛ =⇒ log 1 pt,min ≤ cΛ p⋆(imin) ≤ cΛ δp ≤ cΛ δ , where the final two inequalities are due to p⋆(imin) ≥ δp ≥ δ. Rearranging, we find pt,min ≥ exp −cΛ δ . (130) By an identical calculation, we also have qt,min ≥ exp( −cΛ δ ). As wt,min = min{pt,min, qt,min}, it follows that wt,min ≥ exp( −c...

work page 2021
[26]

Proposition I.5 extends this result to hold for general δp, δq ∈ (0, 1). Proof. First, let σmin,n and σmin,m be the component-wise restricted singular values from (110), and recall from Proposition G.11 that σmin = min{σmin(A, 1⊥), σmin(A⊤, 1⊥)}. Using calculations similar to the proof 65 of Part (2) of Proposition G.16, we will show that σmin(A, 1⊥) = σm...

work page
[27]

squished

For this, recall by definition that σmin(A, 1⊥) = infv∈1⊥\{0} ∥Π1⊥ (Av)∥2 ∥v∥2 . Moreover, for any v ∈ 1⊥ ⊂ R2, it follows that v = c · (−1, 1) for some c ∈ R. Then using the definition of A = Aδp,δq, and recalling that Π1⊥ = I − 1 2 11⊥ ∈ R2×2, it follows by a direct calculation that Π1⊥ (Av) = c · δp − (δp − (1 − δp))/2 −(1 − δp) − (δp − (1 − δp))/2 = c...

work page 2024
[28]

Let p = ( p(1), p(2)) and q = (q(1), q(2))

Exact Characterization of Variance in 2 × 2 Setting. Let p = ( p(1), p(2)) and q = (q(1), q(2)). We first prove that Varp(Aq) ≤ 2 · p(1) · p(2) · ∥A∥2 2 · ∥q − q⋆∥2 2 (146) and Var q(A⊤ p) ≤ 2 · q(1) · q(2) · ∥A∥2 2 · ∥p − p⋆∥2 2 . (147) To prove (146), let v = Aq ∈ R2. By definition of Varp(v) from (35), it is straightforward to compute in the two-dimens...

work page
[29]

Relating KL to Euclidean Distance to Nash. Next, we prove the following relationships between KL and euclidean distance to Nash: ∥p − p⋆∥2 2 ≤ 4 · max(pmin, δp) · KL(p⋆, p) (148) and ∥q − q⋆∥2 2 ≤ 4 · max(qmin, δq) · KL(q⋆, q) . (149) We start by proving (148). For this, recall from Section 3 that for w = ( p, q) ∈ W , we write R(w) = Rm(q) + Rn(q) to den...

work page 2006
[30]

Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp)

In particular, this also implies that ps(2) = min(ps(1), ps(2)) ≤ 1 2 for all s ∈ [0, 1]. Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp) . Substituting this uniform bound into (151), and using the fact that R 1 0 (1 − s)ds = 1 2, we find KL(p⋆, p) ≥ 1 4 · ∥p − p⋆∥2 2 · 1 max(pmin, δp) . Rearranging then ...

work page
[31]

Combining the Pieces. Combining the inequalities of (146) and (147) from Step 1 and expressions (148) and (149) from Step 2, we find Varp(Aq) ≤ 8 · ∥A∥2 2 · pmin · max(qmin, δ) · KL(q⋆, q) and Var q(A⊤ p) ≤ 8 · ∥A∥2 2 · qmin · max(pmin, δ) · KL(p⋆, p) . (152) Here, we additionally used the fact that p(1)p(2) = ( 1 − p(2))p(2) ≤ p(2) = pmin and similarly q...

work page 2025
[32]

base matrix

Otherwise, if δp, δq ≥ 1 10, then the universal last-iterate bound of Theorem 4.7, together with Corollary B.6, would directly give KL(w⋆, wT) ≤ O(T−1/2) for all T ≥ 1. 82 Assumptions on Initialization. We assume that the initialization w0 = ( p0, q0) ∈ ri(W ) is the uniform distribution p0 = (1/2, 1/2) and q0 = (1/2, 1/2). Considering Canonical 2 × 2 Mat...

work page 2025
[33]

Let {wt} and {zt} be the primal and dual iterates of OMWU for η satisfying Assumption 2 and initialized from the uniform distribution

Let A := Aδp,δq ∈ [−1, 1]2×2 be the matrix from Definition I.1. Let {wt} and {zt} be the primal and dual iterates of OMWU for η satisfying Assumption 2 and initialized from the uniform distribution. For each t ≥ 0, let zt = ( xt, yt) and let wt = ( pt, qt). Then for all t ≥ 0, the following bounds hold: (i) F (zt) < F(zt−1) < · · · < F(z1) ≤ 5 4 · F(z0) ≤...

work page
[34]

By definition of Tstartup, Property (b) of Lemma L.10 implies that such a bound on pt,min and qt,min must hold for at least K/η iterations

Then by the universal one-step multiplicative change in KL divergence from Theorem 4.7 (and in particular, using the expression from Corollary H.3, recall that for every such t where pt,min ≥ δp/3 and qt,min ≥ δq/3, we have KL(w⋆, wt+1) ≤ KL(w⋆, wt) · exp − 1 20 η2σ2 min · pt,min · qt,min ≤ KL(w⋆, wt) · exp − 1 720 η2δpδq . By definition of Tstartup, Prop...

work page

[1] [1]

For this, fix w ∈ ri(W )

Energy gradient is globally surjective: First, we show that ∇F : Rm+n → ri(W ) is surjective over the full domain Rm+n. For this, fix w ∈ ri(W ). Recalling that R is the joint negative entropy regularizer, observe that Proposition D.4 shows that∇R(w) = v for some v ∈ Rm+n. Using the definition of∇R, it follows by Proposition D.5 that∇F(v) = ∇F(∇R(w)) = w....

work page

[2] [2]

Further recall from Proposition D.3 that ∇F is invariant to constant shifts, meaning for any s ∈ S that ∇F(v + s) = ∇F(v) = w

Energy gradient is invariant to constant shifts: Now, again fix w ∈ ri(W ), and let v = ∇R(w) ∈ Rm+n. Further recall from Proposition D.3 that ∇F is invariant to constant shifts, meaning for any s ∈ S that ∇F(v + s) = ∇F(v) = w. Thus to prove ∇F is surjective over Z, it is sufficient to establish v + s ∈ Z for some s ∈ S . As v ∈ Rm+n can lie in the full ...

work page

[3] [3]

For this, first observe that the property Z + S = Rm+n is equivalent to establishing equality between the orthogonal complements of the two sets

Sufficient property holds under a unique and interior NE: 27 We will prove that Z + S = Rm+n holds under the assumption of a unique and interior NE w⋆ = ( p⋆, q⋆) ∈ ri(W ). For this, first observe that the property Z + S = Rm+n is equivalent to establishing equality between the orthogonal complements of the two sets. Since Z, S are linear subspaces, we ha...

work page

[4] [4]

Thus over the linear subspace Z ⊂ Rm+n, the first-order optimality conditions for F give (see, e.g., Boyd and Vandenberghe (2004), Sec

Existence of minimizer in effective dual space: Recall that F is convex and continuously differentiable. Thus over the linear subspace Z ⊂ Rm+n, the first-order optimality conditions for F give (see, e.g., Boyd and Vandenberghe (2004), Sec. 4.2.3): z ∈ argminz′∈Z F(z′) ⇐ ⇒ ⟨∇ F(z), z′′⟩ = 0 for all z′′ ∈ Z . Now due to Proposition D.9, ∇F : Z → ri(W ) is ...

work page 2004

[5] [5]

Thus fixing z⋆ ∈ argminz∈Z F(z) from Step (1), then also zτ = z⋆ + τJw ⋆ ∈ argminz∈Z F(z) for any τ ∈ R

Minimum energy function value attained: Recall from expression (32) in the proof of Proposition D.10 that Z ∩ S = Span(Jw ⋆). Thus fixing z⋆ ∈ argminz∈Z F(z) from Step (1), then also zτ = z⋆ + τJw ⋆ ∈ argminz∈Z F(z) for any τ ∈ R. This is because ∇F(z⋆ + s) = ∇F(z⋆) for any s ∈ S , and since Jw ⋆ ∈ S . Thus z′ ∈ Z is in the argmin set if and only if z′ = ...

work page

[6] [6]

Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆

Apply Fenchel-Young identity: Now by Part (ii) of the Fenchel-Young identity from Proposition D.2, we have for all z ∈ Z and w = ∇F(z) that KL(w⋆, w) = R(w⋆) + F(z) − ⟨w⋆, z⟩ = R(w⋆) + F(z) , (33) where the second equality is due to Proposition D.6. Now let z⋆ ∈ Z such that ∇F(z⋆) = w⋆. Then it follows from (33) that 0 = KL(w⋆, w⋆) = R(w⋆) + F(z⋆) = ⇒ R(w...

work page 2004

[7] [7]

Exact expansion of change in energy: The starting point is to take a first-order Taylor expansion of F(zt+1) around zt. Using the structure of the (OMWU Dual) update, as well as the Hessian-based approximation of gradient differences from Proposition E.4, we derive an exact expansion of ∆F(zt) = F(zt+1) − F(zt), the one-step change in energy over the dual...

work page

[8] [8]

Controlling the error terms: Notice that the first term of (53) is exactly the desired dissipation term −η2∥J∇F(zt)∥2 zt (up to a constant factor) appearing in the statement of the lemma. Thus, the remaining technical challenge is to ensure the final three terms of (53) also scale at most like O(η2∥J∇F(zt)∥2 zt ), with a constant factor that can be made l...

work page

[9] [9]

Thus for all such L ≥ 72, it holds that 18 25 + 36 exp(6/L) 25L ≤ 149

work page

[10] [10]

Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ

■ For the inner product involving GF(zt, zt−1), we derive a similar bound: Proposition F.8. Let {zt} be the iterates of (OMWU Dual) with η ≤ 1 21σ . Then for t ≥ 1: η⟨J∇F(zt), GF(zt, zt−1)⟩ ≤ 18 5 · ν(6ησ) · η2∥J∇F(zt)∥2 zt . Moreover, if η ≤ 1 Lσ for L ≥ 72, then 18 25 · ν(6ησ) ≤ 54 5L + 108 exp(6/L) 5L2 ≤ 31 200. Proof. Applying the generalized Cauchy-S...

work page

[11] [11]

Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2

□ Using this new notation and relationships between constants, we now give the overall template for proving claim (i) of the proposition: General template for proving (i). Our goal in proving (i) is to establish that ∥∆t−1∥zt ≤ C · ∥∆t∥zt (83) for C = 2. For this, observe by the triangle inequality that ∥∆t−1∥zt = ∥∆t − (∆t − ∆t−1)∥zt ≤ ∥ ∆t∥zt + ∥∆t − ∆t...

work page

[12] [12]

Non-uniform restricted strong convexity. Using the definition of the minimum restricted eigenvalue M(z) from Definition G.1, the inequality of Proposition G.2, and the characterization of M(z) from Proposition G.3, we have ∥J∇F(z)∥2 z ≥ M(z) · ∥ΠS ⊥ J∇F(z) ∥2 2 ≥ wmin · ∥ΠS ⊥ J∇F(z) ∥2 2

work page

[13] [13]

Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9)

Invariance to Nash and restricted spectrum of J. Let z⋆ ∈ Z such that ∇F(z⋆) = w⋆ (recall that such a z⋆ exists due to Proposition D.9). Now by Proposi- tion B.1, we have J∇F(z⋆) = Jw ⋆ ∈ S . Thus by definition of the projectionΠS ⊥, we have ΠS ⊥ (J∇F(z⋆)) = 0, and therefore by linearity ΠS ⊥ (J∇F(z)) = ΠS ⊥ (J(∇F(z) − ∇F(z⋆))) . Moreover, as w, w⋆ ∈ W , ...

work page

[14] [14]

We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆)

Bounding the dual suboptimality gap. We now derive the following lower bound on ∥∇F(z) − ∇F(z⋆)∥2 2 in terms of the dual gap F(z) − F(z⋆). We proceed in two steps. First, we have from Lemma G.10 the dual relationship between KL(w⋆, w) and χ2(w⋆, w), which gives F(z) − F(z⋆) ≤ ∥∇ F(z) − ∇F(z⋆)∥2 z,∗ . Next, by the restricted local smoothness properties of ...

work page

[15] [15]

Combining (106) from Step 2 and (107) from Step 3 yields ∥J∇F(z)∥2 z ≥ σ2 min · w2 min · (F(z) − F(z⋆))

Equivalence between primal and dual suboptimality gaps. Combining (106) from Step 2 and (107) from Step 3 yields ∥J∇F(z)∥2 z ≥ σ2 min · w2 min · (F(z) − F(z⋆)) . (108) This is exactly a non-uniform skew-gradient-domination property for the energy function F. To relate ∥J∇F(z)∥2 z back to the primal space, we apply the equivalence of Proposition D.11 (also...

work page

[16] [16]

Proof of part (iii)

Following identical calculations, we similarly conclude Varq(A⊤ p) ≥ σ2 min,m · qmin · ∥p − p⋆∥2 2. Proof of part (iii). Using the relationship KL(p⋆, p) ≤ χ2(p⋆, p), we can write and further bound KL(p⋆, p) ≤ χ2(p⋆, p) = ∑ i∈[m] (p(i) − p⋆(i))2 p(i) ≤ 1 pmin · ∑ i∈[m] (p(i) − p⋆(i))2 = 1 pmin · ∥p − p⋆∥2 2 . Rearranging yields ∥p − p⋆∥2 2 ≥ pmin · KL(p⋆,...

work page

[17] [17]

Oranization of section

■ H Details on Universal Last-Iterate Convergence in KL This section gives the proofs of Theorem 4.2 and Theorem 4.7, which establish asymptotic last-iterate convergence and a linear last-iterate convergence rate in KL divergence, respectively. Oranization of section. The section is organized as follows: • Section H.1 first proves a helper lemma that esta...

work page 2021

[18] [18]

■ 57 H.2 Proof of Theorem 4.2 – Asymptotic Last-Iterate Convergence We now give a proof of asymptotic last-iterate convergence to a unique and interior NE

Then substituting (120) into (119) and rearranging further gives KL(w⋆, w1) ≤ KL(w⋆, w0) · 1 + 1 4 , which completes the proof. ■ 57 H.2 Proof of Theorem 4.2 – Asymptotic Last-Iterate Convergence We now give a proof of asymptotic last-iterate convergence to a unique and interior NE. We first restate the theorem: Theorem 4.2 (Asymptotic Last-Iterate Conver...

work page

[19] [19]

Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0)

Primal iterates lie in compact sublevelset: Combining Corollary D.7 and the upper bound of Lemma 4.1, we have for all t ≥ 1: KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) ≤ − 1 20 η2∥J∇F(zt)∥2 zt ≤ 0 . Moreover, by Lemma H.1, we haveKL(w⋆, w1) ≤ (5/4) · KL(w⋆, w0). Together, this means thatKL(w⋆, wt) ≤ (5/4) · KL(w⋆, w0) for all t ≥ 0. Letting κ = (5/4) · K...

work page

[20] [20]

Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p)

Dissipation term is zero only at Nash: For z ∈ Rm+n, let w = ( p, q) = ∇F(z) ∈ ri(W ). Observe by Part (i) of Proposition G.13 that ∥J∇F(z)∥2 z = Varp(Aq) + Varq(A⊤ p). By definition of variance (see expression (35)), note that Varp(v) = 0 and Varq(u) = 0 if and only if v = c1m and u = d1n for some c, d ∈ R. Together, this implies that ∥J∇F(z)∥2 z = 0 if ...

work page

[21] [21]

Moreover, by the continuity ofD(·), if wtk → w∞ for some subsequence {wtk }, then also D(wtk ) → D(w∞)

Subsequence and global convergence: As all wt ∈ U for the compact set U, we have by the Bolzano-Weierstrass theorem that every infinite subsequence of {wt} has at least one limit pointw∞ ∈ U . Moreover, by the continuity ofD(·), if wtk → w∞ for some subsequence {wtk }, then also D(wtk ) → D(w∞). Now as U is compact, (121) implies that KL(w⋆, wtk ) converg...

work page

[22] [22]

Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt)

Change in KL is change in energy: First, let {zt} denote the dual OMWU iterates. Recall from Corollary D.7 that for all t ≥ 0, we have KL(w⋆, wt+1) − KL(w⋆, wt) = F(zt+1) − F(zt) . (122)

work page

[23] [23]

Upper bound on energy dissipation: From Lemma 4.1, we have for all t ≥ 1 under the constraint on η that F(zt+1) − F(zt) ≤ − 1 20 · η2∥J∇F(zt)∥2 zt . (123)

work page

[24] [24]

Combining expressions (122), (123), and (124), we find for all t ≥ 1 KL(w⋆, wt+1) − KL(w⋆, wt) ≤ − 1 20 η2 · σ2 minw2 t,min · KL(w⋆, wt)

Non-uniform skew-gradient domination: By Proposition 4.4, we further have the lower bound ∥J∇F(zt)∥2 zt ≥ σ2 min · w2 t,min · KL(w⋆, wt) , (124) where σmin > 0 under the unique and interior Nash equilibrium assumption. Combining expressions (122), (123), and (124), we find for all t ≥ 1 KL(w⋆, wt+1) − KL(w⋆, wt) ≤ − 1 20 η2 · σ2 minw2 t,min · KL(w⋆, wt) ....

work page 2021

[25] [25]

dual" perspective) and Proposition G.13 (proven using a slightly more involved “primal

Then combin- ing (128) and (129), we have p⋆(imin) log 1 pt,min ≤ cΛ =⇒ log 1 pt,min ≤ cΛ p⋆(imin) ≤ cΛ δp ≤ cΛ δ , where the final two inequalities are due to p⋆(imin) ≥ δp ≥ δ. Rearranging, we find pt,min ≥ exp −cΛ δ . (130) By an identical calculation, we also have qt,min ≥ exp( −cΛ δ ). As wt,min = min{pt,min, qt,min}, it follows that wt,min ≥ exp( −c...

work page 2021

[26] [26]

Proposition I.5 extends this result to hold for general δp, δq ∈ (0, 1). Proof. First, let σmin,n and σmin,m be the component-wise restricted singular values from (110), and recall from Proposition G.11 that σmin = min{σmin(A, 1⊥), σmin(A⊤, 1⊥)}. Using calculations similar to the proof 65 of Part (2) of Proposition G.16, we will show that σmin(A, 1⊥) = σm...

work page

[27] [27]

squished

For this, recall by definition that σmin(A, 1⊥) = infv∈1⊥\{0} ∥Π1⊥ (Av)∥2 ∥v∥2 . Moreover, for any v ∈ 1⊥ ⊂ R2, it follows that v = c · (−1, 1) for some c ∈ R. Then using the definition of A = Aδp,δq, and recalling that Π1⊥ = I − 1 2 11⊥ ∈ R2×2, it follows by a direct calculation that Π1⊥ (Av) = c · δp − (δp − (1 − δp))/2 −(1 − δp) − (δp − (1 − δp))/2 = c...

work page 2024

[28] [28]

Let p = ( p(1), p(2)) and q = (q(1), q(2))

Exact Characterization of Variance in 2 × 2 Setting. Let p = ( p(1), p(2)) and q = (q(1), q(2)). We first prove that Varp(Aq) ≤ 2 · p(1) · p(2) · ∥A∥2 2 · ∥q − q⋆∥2 2 (146) and Var q(A⊤ p) ≤ 2 · q(1) · q(2) · ∥A∥2 2 · ∥p − p⋆∥2 2 . (147) To prove (146), let v = Aq ∈ R2. By definition of Varp(v) from (35), it is straightforward to compute in the two-dimens...

work page

[29] [29]

Relating KL to Euclidean Distance to Nash. Next, we prove the following relationships between KL and euclidean distance to Nash: ∥p − p⋆∥2 2 ≤ 4 · max(pmin, δp) · KL(p⋆, p) (148) and ∥q − q⋆∥2 2 ≤ 4 · max(qmin, δq) · KL(q⋆, q) . (149) We start by proving (148). For this, recall from Section 3 that for w = ( p, q) ∈ W , we write R(w) = Rm(q) + Rn(q) to den...

work page 2006

[30] [30]

Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp)

In particular, this also implies that ps(2) = min(ps(1), ps(2)) ≤ 1 2 for all s ∈ [0, 1]. Then by concavity, for all s ∈ [0, 1]: ps(2) · (1 − ps(2)) ≤ ps(2) ≤ max(p(2), δp) = max(pmin, δp) . Substituting this uniform bound into (151), and using the fact that R 1 0 (1 − s)ds = 1 2, we find KL(p⋆, p) ≥ 1 4 · ∥p − p⋆∥2 2 · 1 max(pmin, δp) . Rearranging then ...

work page

[31] [31]

Combining the Pieces. Combining the inequalities of (146) and (147) from Step 1 and expressions (148) and (149) from Step 2, we find Varp(Aq) ≤ 8 · ∥A∥2 2 · pmin · max(qmin, δ) · KL(q⋆, q) and Var q(A⊤ p) ≤ 8 · ∥A∥2 2 · qmin · max(pmin, δ) · KL(p⋆, p) . (152) Here, we additionally used the fact that p(1)p(2) = ( 1 − p(2))p(2) ≤ p(2) = pmin and similarly q...

work page 2025

[32] [32]

base matrix

Otherwise, if δp, δq ≥ 1 10, then the universal last-iterate bound of Theorem 4.7, together with Corollary B.6, would directly give KL(w⋆, wT) ≤ O(T−1/2) for all T ≥ 1. 82 Assumptions on Initialization. We assume that the initialization w0 = ( p0, q0) ∈ ri(W ) is the uniform distribution p0 = (1/2, 1/2) and q0 = (1/2, 1/2). Considering Canonical 2 × 2 Mat...

work page 2025

[33] [33]

Let {wt} and {zt} be the primal and dual iterates of OMWU for η satisfying Assumption 2 and initialized from the uniform distribution

Let A := Aδp,δq ∈ [−1, 1]2×2 be the matrix from Definition I.1. Let {wt} and {zt} be the primal and dual iterates of OMWU for η satisfying Assumption 2 and initialized from the uniform distribution. For each t ≥ 0, let zt = ( xt, yt) and let wt = ( pt, qt). Then for all t ≥ 0, the following bounds hold: (i) F (zt) < F(zt−1) < · · · < F(z1) ≤ 5 4 · F(z0) ≤...

work page

[34] [34]

By definition of Tstartup, Property (b) of Lemma L.10 implies that such a bound on pt,min and qt,min must hold for at least K/η iterations

Then by the universal one-step multiplicative change in KL divergence from Theorem 4.7 (and in particular, using the expression from Corollary H.3, recall that for every such t where pt,min ≥ δp/3 and qt,min ≥ δq/3, we have KL(w⋆, wt+1) ≤ KL(w⋆, wt) · exp − 1 20 η2σ2 min · pt,min · qt,min ≤ KL(w⋆, wt) · exp − 1 720 η2δpδq . By definition of Tstartup, Prop...

work page