Log-Averaged Mirror Prox for Fast, Large-Scale Optimal Transport in Linear Space

Jiaming Liang; Matthew X. Burns

arxiv: 2511.11359 · v4 · submitted 2025-11-14 · 🧮 math.OC · cs.DS

Log-Averaged Mirror Prox for Fast, Large-Scale Optimal Transport in Linear Space

Matthew X. Burns , Jiaming Liang This is my paper

Pith reviewed 2026-05-17 22:18 UTC · model grok-4.3

classification 🧮 math.OC cs.DS

keywords optimal transportmirror proxprimal-dual methodslinear memorylarge-scale optimizationentropic regularizationfirst-order algorithmslast-iterate convergence

0 comments

The pith

LAMP solves large-scale optimal transport in linear memory by tracking an averaged dual sequence inside mirror prox updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Log-Averaged Mirror Prox, or LAMP, as a way to run primal-dual mirror prox for optimal transport while storing only linear space instead of the full nm matrix. It does this by maintaining an averaged dual sequence that lets the algorithm perform the same primal updates without keeping every dual variable. The method keeps the original last-iterate convergence rate of roughly nm over epsilon arithmetic steps. It also supplies a sub-optimality certificate based on infeasibility and an explicit 1/t term, plus a sufficient condition for best-iterate saddle-point convergence. Experiments with a CUDA implementation show faster practical performance on high-accuracy entropic problems and the ability to handle marginals up to size 2 to the 18.

Core claim

LAMP implements the primal mirror-prox updates by tracking an averaged dual sequence rather than the full dual matrix. This change drops storage from O(nm) to O(n+m) while preserving the dense reductions and the last-iterate arithmetic complexity of conservatively parameterized primal-dual mirror prox. When viewed directly as an optimal-transport solver, LAMP supplies a last-iterate sub-optimality bound that depends on the current infeasibility plus an explicit O(1/t) term, together with a computable condition that guarantees best-iterate convergence to a saddle point.

What carries the argument

the log-averaged dual sequence that substitutes for the full dual matrix inside the primal mirror-prox step

If this is right

Storage cost drops from quadratic to linear in the support sizes while arithmetic work stays the same order.
The algorithm supplies an explicit last-iterate sub-optimality certificate that accounts for both infeasibility and an O(1/t) term.
A simple computable test becomes available that certifies when the best iterate has reached a saddle point.
Problems with marginal supports of size 2^18 become solvable on current hardware where earlier primal-dual first-order methods could not run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same averaging trick might extend to other primal-dual first-order schemes that currently store full dual matrices in transportation or assignment problems.
Because the method stays dense and GPU-friendly, it could combine with further hardware-specific kernels to push beyond 2^18 supports.
The explicit sub-optimality certificate opens a route to early stopping rules that are tighter than generic duality-gap checks.

Load-bearing premise

That maintaining the averaged dual sequence produces exactly the same primal trajectory and convergence behavior as the original mirror-prox method without hidden conditions on the cost matrix or regularization.

What would settle it

Running both standard primal-dual mirror prox and LAMP on an identical small entropic optimal-transport instance and observing materially different last-iterate distances to optimality or different final transport plans.

Figures

Figures reproduced from arXiv: 2511.11359 by Jiaming Liang, Matthew X. Burns.

**Figure 2.** Figure 2: EOT objective gap with varying regularization strength [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: EOT objective gap with varying problem size [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: [Top] Infeasibility and function value comparisons in color transfer tasks. [Bottom] Visual [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 6.** Figure 6: Barycenter comparison between Iterative Bregman Projection (IBP) [BCC [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

read the original abstract

We propose Log-Averaged Mirror Prox (LAMP), a linear-space primal-dual method for large-scale optimal transport. LAMP implements primal mirror prox updates by tracking an averaged dual sequence, reducing storage complexity from ${O}(nm)$ to $O(n+m)$ while preserving dense, GPU-friendly reductions. Consequently, LAMP preserves the last-iterate $\widetilde{O}( nm\varepsilon^{-1})$ arithmetic complexity of conservatively parameterized primal-dual mirror prox. We further analyze LAMP as a direct optimal transport solver in a more performant parameter regime, providing a last-iterate sub-optimality certificate dependent on infeasibility and an explicit $O(1/t)$ term. Moreover, we give a computable sufficient condition for best-iterate convergence to a saddle-point. Numerical experiments with an optimized CUDA implementation show that LAMP outperforms first-order baselines in several high-accuracy (entropic) optimal transport problems. LAMP is further shown to scale up to problems with $n=m=2^{18}$ marginal supports, which were previously beyond the reach of primal-dual first-order methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAMP gives a workable linear-space mirror prox for big OT problems by averaging the dual, with solid scaling experiments, but the exact match to original convergence rates needs close checking in the proofs.

read the letter

LAMP reduces memory for primal-dual mirror prox on optimal transport by tracking a log-averaged dual sequence instead of storing the full dual. This drops storage from O(nm) to O(n+m) while keeping the arithmetic work at the same Õ(nm ε^{-1}) level for the conservative parameter choice, and the authors supply a last-iterate sub-optimality bound that depends on infeasibility plus a sufficient condition for best-iterate convergence in a faster regime. The CUDA implementation is the clearest win: it handles n=m=2^18 instances and beats other first-order baselines on high-accuracy entropic OT tasks that were previously out of reach for these methods. That scaling result is concrete and addresses a real bottleneck. The analysis builds on standard mirror prox ideas but adds the averaging step and the new certificate. The main soft spot is whether the log-average substitution preserves the exact primal trajectory and extrapolation without extra assumptions on the cost or entropy parameter; if the derivation shows the operators commute in the needed way, the guarantees carry over cleanly, but any gap there would weaken the equivalence claim. The experiments give some reassurance that the method performs as intended in practice. This paper is for optimization researchers and ML practitioners who need first-order OT solvers that fit in limited memory without switching to approximations. It deserves a serious referee because the scaling evidence and the practical implementation are strong enough to warrant detailed review even if the theory section needs tightening.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes Log-Averaged Mirror Prox (LAMP), a primal-dual first-order method for entropic optimal transport. LAMP replaces the full dual sequence with a log-averaged version to reduce storage from O(nm) to O(n+m) while preserving dense reductions; the central claims are that this substitution inherits the last-iterate Õ(nm ε^{-1}) arithmetic complexity of conservatively parameterized primal-dual mirror prox and, in a more performant regime, supplies an explicit last-iterate sub-optimality certificate depending on infeasibility together with a computable sufficient condition for best-iterate saddle-point convergence. Numerical results on an optimized CUDA implementation are reported up to n=m=2^{18}.

Significance. If the equivalence between the log-averaged trajectory and standard mirror-prox updates is rigorously established, the work would meaningfully extend the practical reach of primal-dual first-order methods for large-scale OT, particularly on GPU hardware where memory is the binding constraint. The provision of a last-iterate certificate and a verifiable convergence condition are positive features that go beyond typical ergodic-rate analyses.

major comments (3)

[§3.1–3.2] §3.1–3.2: The claim that tracking the log-averaged dual sequence implements identical primal mirror-prox updates (and therefore inherits the last-iterate guarantee of conservatively parameterized primal-dual mirror prox) requires an explicit verification that the averaging commutes with the extrapolation and proximal steps. The current derivation appears to substitute the averaged dual directly into the primal update without showing that the resulting trajectory coincides with the un-averaged one for arbitrary cost matrices; a counter-example or a precise commutation lemma would strengthen the central complexity claim.
[Theorem 4.1] Theorem 4.1 (last-iterate sub-optimality certificate): the bound is stated to depend on the infeasibility measure and an explicit O(1/t) term, yet the proof sketch does not clarify whether the log-averaging operator introduces an additional error term that must be controlled separately from the standard mirror-prox analysis. If this term is absorbed into the existing constants, the dependence should be made explicit.
[§4.3] §4.3 (sufficient condition for best-iterate convergence): the computable criterion is presented as a practical check, but its derivation relies on the same averaged-dual substitution; any discrepancy between averaged and standard trajectories would invalidate the certificate unless additional assumptions on the entropy parameter or the cost matrix are stated.

minor comments (3)

[Abstract] The abstract states preservation of complexity but supplies no derivation outline; a one-sentence pointer to the relevant theorem in the introduction would improve readability.
[§2] Notation for the log-average operator (e.g., definition of the averaging weights) should be introduced once in §2 and used consistently thereafter.
[Numerical experiments] Figure 1 and the scaling experiments would benefit from explicit reporting of the number of iterations required to reach the reported accuracy, rather than only wall-clock time.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and have revised the manuscript to strengthen the theoretical derivations where needed.

read point-by-point responses

Referee: The claim that tracking the log-averaged dual sequence implements identical primal mirror-prox updates requires an explicit verification that the averaging commutes with the extrapolation and proximal steps. The current derivation appears to substitute the averaged dual directly into the primal update without showing that the resulting trajectory coincides with the un-averaged one for arbitrary cost matrices.

Authors: We acknowledge that the original derivation in §3.1–3.2 would benefit from a more explicit commutation argument. In the revised manuscript we have added a new Lemma 3.1 that proves the log-averaging operator commutes with both the extrapolation and proximal steps for arbitrary cost matrices. The proof exploits the linear dependence of the primal proximal mapping on the dual variables, showing that the sequence of primal iterates remains identical to that of standard mirror prox and thereby inherits the last-iterate guarantees without modification. revision: yes
Referee: Theorem 4.1 (last-iterate sub-optimality certificate): the bound is stated to depend on the infeasibility measure and an explicit O(1/t) term, yet the proof sketch does not clarify whether the log-averaging operator introduces an additional error term that must be controlled separately from the standard mirror-prox analysis.

Authors: We thank the referee for this observation. The revised proof of Theorem 4.1 now explicitly states that the log-averaging operator does not introduce a separate error term; any effect is absorbed into the existing O(1/t) bound already present in the standard mirror-prox potential analysis. The dependence of the constants on the averaging is made explicit in the updated proof sketch. revision: yes
Referee: §4.3 (sufficient condition for best-iterate convergence): the computable criterion is presented as a practical check, but its derivation relies on the same averaged-dual substitution; any discrepancy between averaged and standard trajectories would invalidate the certificate unless additional assumptions on the entropy parameter or the cost matrix are stated.

Authors: We have revised §4.3 to reference the new commutation lemma from §3.2. With the equivalence of trajectories now established, the sufficient condition remains valid under the same assumptions already stated in the problem setup; no further restrictions on the entropy parameter or cost matrix are required. revision: yes

Circularity Check

0 steps flagged

No circularity: LAMP is an independent algorithmic construction with self-contained analysis

full rationale

The paper introduces LAMP as a direct modification of primal-dual mirror prox that replaces the dual sequence with its log-average to achieve O(n+m) storage while claiming to preserve the Õ(nm ε^{-1}) last-iterate rate. This is presented as a new algorithmic device whose convergence properties are then analyzed separately, including a last-iterate sub-optimality certificate that depends on infeasibility and an explicit O(1/t) term. No step in the provided abstract or description reduces a claimed prediction or rate to a fitted parameter, a self-citation chain, or a redefinition of the input; the derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into internal assumptions; method appears to rest on standard mirror-prox and entropic OT convergence theory rather than new ad-hoc postulates.

axioms (1)

domain assumption Standard assumptions on the optimal transport problem (non-negative costs, marginal constraints) suffice for the stated convergence rates.
Implicit in the claim that LAMP preserves mirror-prox complexity for entropic OT.

pith-pipeline@v0.9.0 · 5493 in / 1254 out tokens · 35021 ms · 2026-05-17T22:18:10.393056+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LAMP implements primal mirror prox updates by tracking an averaged dual sequence, reducing storage complexity from O(nm) to O(n+m) while preserving the last-iterate Õ(nm ε^{-1}) arithmetic complexity of conservatively parameterized primal-dual mirror prox.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Log-Averaged Mirror Prox (LAMP), a linear-space primal-dual method for large-scale optimal transport.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Semi- Streaming Bipartite Matching in Fewer Passes and Optimal Space

[AJJ+22] Sepehr Assadi, Arun Jambulapati, Yujia Jin, Aaron Sidford, and Kevin Tian. Semi- Streaming Bipartite Matching in Fewer Passes and Optimal Space. InProceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Proceedings, pages 627–669. Society for Industrial and Applied Mathematics, January

work page 2022
[2]

Altschuler

33 [Alt22] Jason M. Altschuler. Flows, scaling, and entropy revisited: A unified perspective via optimizing joint distributions.https://arxiv.org/abs/2210.16456,

work page arXiv
[3]

Springer Nature Switzerland, Cham,

[CNR25] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical Optimal Trans- port: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX – 2019, volume 2364 ofLecture Notes in Mathematics. Springer Nature Switzerland, Cham,

work page 2019
[4]

Fast algorithms for computational optimal transport and Wasserstein barycenter

[GHJ20] Wenshuo Guo, Nhat Ho, and Michael Jordan. Fast algorithms for computational optimal transport and Wasserstein barycenter. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 2088–2097. PMLR,

work page 2088
[5]

Path finding methods for linear programming: Solving linear programsin ˜O( √ rank) iterations and faster algorithms for maximum flow

[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programsin ˜O( √ rank) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 424– 433, Philadelphia, PA, USA,

work page 2014
[6]

arXiv preprint arXiv:2407.19689

[LY24] Haihao Lu and Jinwen Yang. PDOT: A Practical Primal-Dual Algorithm and a GPU- Based Solver for Optimal Transport.https://arxiv.org/abs/2407.19689,

work page arXiv
[7]

Area-convexity, l∞regularization, and undirected multicommodity flow

[She17] Jonah Sherman. Area-convexity, l∞regularization, and undirected multicommodity flow. InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 452–460, New York, NY, USA,

work page 2017

[1] [1]

Semi- Streaming Bipartite Matching in Fewer Passes and Optimal Space

[AJJ+22] Sepehr Assadi, Arun Jambulapati, Yujia Jin, Aaron Sidford, and Kevin Tian. Semi- Streaming Bipartite Matching in Fewer Passes and Optimal Space. InProceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Proceedings, pages 627–669. Society for Industrial and Applied Mathematics, January

work page 2022

[2] [2]

Altschuler

33 [Alt22] Jason M. Altschuler. Flows, scaling, and entropy revisited: A unified perspective via optimizing joint distributions.https://arxiv.org/abs/2210.16456,

work page arXiv

[3] [3]

Springer Nature Switzerland, Cham,

[CNR25] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet.Statistical Optimal Trans- port: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XLIX – 2019, volume 2364 ofLecture Notes in Mathematics. Springer Nature Switzerland, Cham,

work page 2019

[4] [4]

Fast algorithms for computational optimal transport and Wasserstein barycenter

[GHJ20] Wenshuo Guo, Nhat Ho, and Michael Jordan. Fast algorithms for computational optimal transport and Wasserstein barycenter. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 2088–2097. PMLR,

work page 2088

[5] [5]

Path finding methods for linear programming: Solving linear programsin ˜O( √ rank) iterations and faster algorithms for maximum flow

[LS14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programsin ˜O( √ rank) iterations and faster algorithms for maximum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 424– 433, Philadelphia, PA, USA,

work page 2014

[6] [6]

arXiv preprint arXiv:2407.19689

[LY24] Haihao Lu and Jinwen Yang. PDOT: A Practical Primal-Dual Algorithm and a GPU- Based Solver for Optimal Transport.https://arxiv.org/abs/2407.19689,

work page arXiv

[7] [7]

Area-convexity, l∞regularization, and undirected multicommodity flow

[She17] Jonah Sherman. Area-convexity, l∞regularization, and undirected multicommodity flow. InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 452–460, New York, NY, USA,

work page 2017