pith. machine review for the scientific record. sign in

arxiv: 2604.09276 · v1 · submitted 2026-04-10 · 💻 cs.LG

Recognition: unknown

Distributed Online Convex Optimization with Compressed Communication: Optimal Regret and Applications

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords distributed online convex optimizationcompressed communicationregret boundserror feedbackfollow-the-regularized-leaderstochastic optimizationonline learning
0
0 comments X

The pith

Distributed online convex optimization achieves optimal regret bounds under compressed communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies how compressing messages between local learners and a central server affects performance in distributed online convex optimization with streaming data. It first proves lower bounds showing that any algorithm must incur regret scaling as Omega(delta to the minus one-half times square root of T) for convex losses and Omega(delta to the minus one times log T) for strongly convex losses, where delta is the fixed compression ratio. It then gives an algorithm that matches these bounds exactly by folding an error-feedback correction into the Follow-the-Regularized-Leader method and using an online compression rule to control accumulated bidirectional errors. The same construction yields the first convergence guarantees for the corresponding offline stochastic problem with domain constraints and non-smooth losses. A reader would care because it shows that substantial communication savings need not destroy asymptotic optimality in large-scale distributed learning.

Core claim

The paper establishes matching lower and upper bounds for distributed online convex optimization under compressed communication: lower bounds of Omega(delta^{-1/2} sqrt(T)) and Omega(delta^{-1} log T), and an algorithm achieving O(delta^{-1/2} sqrt(T)) and O(delta^{-1} log T) for convex and strongly convex losses. The algorithm embeds error feedback inside Follow-the-Regularized-Leader and applies online compression to bidirectional links, then converts to stochastic rates of O(delta^{-1/2} T^{-1/2}) and O(delta^{-1} T^{-1}) via online-to-batch.

What carries the argument

Error-feedback correction embedded inside the Follow-the-Regularized-Leader framework, paired with an online compression strategy that prevents accumulation of bidirectional quantization errors.

If this is right

  • The upper bounds are information-theoretically tight because they match the derived lower bounds for any fixed delta.
  • The same algorithm immediately supplies the first convergence rates for distributed non-smooth stochastic optimization with domain constraints and compressed communication.
  • The construction works for general compression operators that admit a fixed ratio and extends without change to strongly convex losses.
  • Online-to-batch conversion preserves the dependence on delta while turning regret into convergence rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The error-feedback technique inside FTRL could be reused in other online settings that face quantization noise or delayed updates.
  • The result suggests that bandwidth-limited edge devices can still learn at the optimal rate asymptotically, provided the compression ratio stays constant.
  • Adaptive or data-dependent choices of delta might yield better practical constants while retaining the same worst-case scaling.
  • Similar correction mechanisms may help federated or decentralized variants of online convex optimization.

Load-bearing premise

The compression operator has a fixed ratio delta that does not depend on the iterates, and error feedback can be inserted into Follow-the-Regularized-Leader while keeping the standard online convex optimization assumptions intact.

What would settle it

A concrete compression operator with fixed ratio delta for which every algorithm (or the proposed one) produces regret strictly larger than the stated O bounds on some sequence of convex losses.

read the original abstract

Distributed online convex optimization (D-OCO) is a powerful paradigm for modeling distributed scenarios with streaming data. However, the communication cost between local learners and the central server is substantial in large-scale applications. To alleviate this bottleneck, we initiate the study of D-OCO with compressed communication. Firstly, to quantify the compression impact, we establish the $\Omega(\delta^{-1/2}\sqrt{T})$ and $\Omega(\delta^{-1}\log{T})$ lower bounds for convex and strongly convex loss functions, respectively, where $\delta \in (0,1]$ is the compression ratio. Secondly, we propose an optimal algorithm, which enjoys regret bounds of $O(\delta^{-1/2}\sqrt{T})$ and $O(\delta^{-1} \log T)$ for convex and strongly convex loss functions, respectively. Our method incorporates the error feedback mechanism into the Follow-the-Regularized-Leader framework to address the coupling between the compression error and the projection error. Furthermore, we employ the online compression strategy to mitigate the accumulated error arising from the bidirectional compression. Our online method has great generality, and can be extended to the offline stochastic setting via online-to-batch conversion. We establish convergence rates of $O(\delta^{-1/2}T^{-1/2})$ and $O(\delta^{-1} T^{-1})$ for convex and strongly convex loss functions, respectively, providing the first guarantees for distributed non-smooth optimization with compressed communication and domain constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript initiates the study of distributed online convex optimization (D-OCO) with compressed communication. It establishes information-theoretic lower bounds of Ω(δ^{-1/2} √T) for convex losses and Ω(δ^{-1} log T) for strongly convex losses (δ ∈ (0,1] the compression ratio). It then constructs an algorithm that achieves matching upper bounds O(δ^{-1/2} √T) and O(δ^{-1} log T) by incorporating error feedback into the Follow-the-Regularized-Leader (FTRL) framework and using an online compression strategy for bidirectional communication. The method extends via online-to-batch conversion to yield convergence rates O(δ^{-1/2} T^{-1/2}) and O(δ^{-1} T^{-1}) in the offline stochastic setting with domain constraints.

Significance. If the regret bounds and their proofs hold, the work provides the first tight characterization of the communication-compression tradeoff in D-OCO, including the first guarantees for non-smooth distributed optimization under compression and constraints. The explicit matching of lower and upper bounds, the handling of bidirectional compression via online strategies, and the online-to-batch extension are concrete strengths that advance both theory and practice in communication-efficient distributed learning.

major comments (2)
  1. [§4.2] §4.2 (FTRL update with error feedback): The analysis incorporates the feedback term e_t into the linear loss seen by FTRL and claims the standard OCO regret decomposition holds up to δ-scaled terms. However, the inner product between the compression error and the projection residual (x_{t+1} - y_{t+1}) after the projection onto K is not shown to be controlled by O(δ) (or absorbed without rate degradation). If this cross term is only O(√δ) or larger, the claimed O(δ^{-1/2} √T) regret fails to match the lower bound. The appendix telescoping argument assumes a form of orthogonality that requires explicit verification.
  2. [Proof of upper bound (Appendix)] Proof of upper bound (likely Theorem 4.1 or Appendix A): The error-feedback mechanism is stated to address the coupling between compression error and projection error, but the derivation of the per-step regret term after projection must be expanded to bound all cross terms explicitly. Without this, the optimality claim relative to the lower bound in §3 remains conditional on an unverified step.
minor comments (3)
  1. [§2] The definition of the compression operator and the fixed ratio δ should be stated with the precise assumption (e.g., unbiasedness or variance bound) in the main text rather than deferred entirely to the appendix, to make the lower-bound construction self-contained.
  2. Notation for the online compression strategy (distinct from standard compression) is introduced in the abstract but the difference from prior error-feedback methods is not highlighted with a dedicated comparison paragraph or table.
  3. [§5] The online-to-batch conversion rates are stated without an explicit statement of the batch size or number of rounds used in the conversion; adding this would clarify the dependence on T.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The positive assessment of the significance of our tight bounds for D-OCO under compressed communication is appreciated. We address the two major comments on the upper-bound analysis below. We agree that additional explicit verification of the cross terms is warranted and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (FTRL update with error feedback): The analysis incorporates the feedback term e_t into the linear loss seen by FTRL and claims the standard OCO regret decomposition holds up to δ-scaled terms. However, the inner product between the compression error and the projection residual (x_{t+1} - y_{t+1}) after the projection onto K is not shown to be controlled by O(δ) (or absorbed without rate degradation). If this cross term is only O(√δ) or larger, the claimed O(δ^{-1/2} √T) regret fails to match the lower bound. The appendix telescoping argument assumes a form of orthogonality that requires explicit verification.

    Authors: We thank the referee for highlighting this point. The error-feedback mechanism feeds the compression error e_t back into the subsequent FTRL linear loss, so that the effective update accounts for prior compression. The per-step regret decomposition therefore includes the standard FTRL terms plus an additional inner product involving e_t and the projection residual (x_{t+1} - y_{t+1}). Because the compressor satisfies the δ-compression property and the feedback accumulates the exact residual error, the sum of these inner products telescopes. Using the non-expansiveness of the Euclidean projection and the fact that the residual is controlled by the gradient norm, the cross term is bounded by O(δ) times a quantity whose sum yields an additive O(δ^{-1/2} √T) contribution after applying Cauchy-Schwarz. No orthogonality is assumed; the bound follows directly from the compression definition and the projection inequality. We will insert a dedicated lemma in the appendix that carries out this calculation explicitly. revision: yes

  2. Referee: [Proof of upper bound (Appendix)] Proof of upper bound (likely Theorem 4.1 or Appendix A): The error-feedback mechanism is stated to address the coupling between compression error and projection error, but the derivation of the per-step regret term after projection must be expanded to bound all cross terms explicitly. Without this, the optimality claim relative to the lower bound in §3 remains conditional on an unverified step.

    Authors: We agree that the current write-up would benefit from a fully expanded derivation. In the revised appendix we will present the complete per-step regret expression after the projection step, isolating every cross term that involves the bidirectional compression errors. Each such term is controlled by the δ-compression property together with the online compression strategy that prevents error accumulation across rounds. The resulting bounds remain O(δ) per step and, when summed, preserve the O(δ^{-1/2} √T) and O(δ^{-1} log T) rates for convex and strongly convex losses, respectively, thereby matching the information-theoretic lower bounds of §3. The online-to-batch extension follows identically. revision: yes

Circularity Check

0 steps flagged

No circularity: lower bounds information-theoretic, upper bounds from explicit FTRL construction with error feedback.

full rationale

The paper separately derives information-theoretic lower bounds Ω(δ^{-1/2}√T) and Ω(δ^{-1} log T) and constructs an algorithm achieving matching O upper bounds via incorporation of error feedback into FTRL plus online compression. No step reduces a claimed regret to a fitted quantity, self-citation chain, or definitional tautology; the central regret decomposition is built from standard OCO analysis plus controlled compression terms rather than presupposing the target bound. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard convexity assumptions from online convex optimization literature; no new free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Loss functions are convex (or μ-strongly convex)
    Standard assumption required for the regret definitions and bounds in OCO.
  • domain assumption Existence of a compression operator with fixed ratio δ ∈ (0,1]
    Defines the communication model used throughout the lower and upper bound statements.

pith-pipeline@v0.9.0 · 5568 in / 1198 out tokens · 74778 ms · 2026-05-10T17:37:59.868620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    On the generalization ability of on-line learning algorithms.IEEE Transactions on Information Theory, 50(9):2050–2057,

    Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms.IEEE Transactions on Information Theory, 50(9):2050–2057,

  2. [2]

    Federated learning of out-of-vocabulary words.arXiv preprint arXiv:1903.10635,

    Mingqing Chen, Rajiv Mathews, Tom Ouyang, and Francoise Beaufays. Federated learning of out-of-vocabulary words.arXiv preprint arXiv:1903.10635,

  3. [3]

    Elbir, Burak Soner, and Sinem Coleri

    Ahmet M. Elbir, Burak Soner, and Sinem Coleri. Federated learning in vehicular networks.arXiv preprint arXiv:2006.01412,

  4. [4]

    Federated Learning for Mobile Keyboard Prediction

    Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Francoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction.arXiv preprint arXiv:1811.03604,

  5. [5]

    A Modern Introduction to Online Learning

    Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213,

  6. [6]

    Sebastian U. Stich. On communication compression for distributed optimization on heterogeneous data.arXiv preprint arXiv:2009.02388,

  7. [7]

    Distributed online convex optimization with efficient communication: Improved algorithm and lower bounds.arXiv preprint arXiv:2601.04907,

    Sifan Yang, Wenhao Yang, Wei Jiang, and Lijun Zhang. Distributed online convex optimization with efficient communication: Improved algorithm and lower bounds.arXiv preprint arXiv:2601.04907,

  8. [8]

    t−1X k=1 sk −v k # +E C

    14 A Additional Discussions on Related Work A.1 Compressor Generally, compressors can be categorized into two classes: unbiased compressors [Jiang and Agrawal, 2018, Tang et al., 2018, Zhang et al., 2017a] and contractive compressors [Seide et al., 2014, Stich et al., 2018, Richtárik et al., 2022, Beznosikov et al., 2023]. An unbiased compressor satisfies...

  9. [9]

    1 n TX t=1 nX i=1 f t i (wt)−min w∈W 1 n TX t=1 nX i=1 f t i (w) # = n−m n EC

    and assume two protocols: (i) all learners can only communication with the central server and cannot exchange information with one another; (ii) all updates are synchronized, which means all learners update their decisions using the same information received from the central server, ensuring that the decisions across all learners are consistent. To charac...

  10. [10]

    PT t=1 ˆℓt(wt)− ˆℓt(x) α1:T # By settingα t = 1, ˆℓt i(w) =⟨g t i,w⟩, ˆℓt(w) =⟨g t,w⟩andg t = 1 n Pn i=1 gt i, we can obtain EC[f(x K)−f(x ⋆)] =E C

    B.7 Proof of Theorem 5.2 We first recall the property of the subgradient. For a subgradient gi ∈∂f i(x) of a convex function fi(·) at x, we have fi(y)≥f i(x) +⟨g i,y−x⟩ . Then, we first give the theoretical guarantee of the anytime online-to-batch conversion. Lemma B.7.(Theorem 2 in Cutkosky [2019]) We assume ˆℓt(x) is convex and satisfiesf(x t)−f(x)≤ E[ˆ...

  11. [11]

    27 Next, we give the regret bound on the loss function ˆℓt(w)

    4 ∥w∥2 . 27 Next, we give the regret bound on the loss function ˆℓt(w). RK = KX t=1 ˆℓt(wt)− ˆℓt(w) ≤ KX t=1 ˆℓt( ˆwt)− ˆℓt(w) + ˆℓt(wt)− ˆℓt( ˆwt) ≤ KX t=1 ˆℓt( ˆwt)− ˆℓt(w) + (tG+µtD) ˆwt −w t . To bound the termPK t=1 ˆℓt( ˆwt)− ˆℓt(w), we introduce a lemma. Lemma B.9(Corollary 5 in Cutkosky [2019]).Under the Assumptions 3.2, 3.4 and 5.1, by setting th...

  12. [12]

    Additional discussion.It is worth noting that our method involves only K updates, with the total communication rounds amounting to T=⌈ 1 δ ⌉K

    ≤O (G+µD) 2 µK =O (G+µD) 2 δµT . Additional discussion.It is worth noting that our method involves only K updates, with the total communication rounds amounting to T=⌈ 1 δ ⌉K. In contrast, standard compression-based methods (e.g., O2B with Algorithm 2 and standard compressor) typically perform T updates over T communication rounds, yielding a convergence ...

  13. [13]

    C.3 Proof of Lemma B.8 We first give the bound of the compression error of each learner

    ≤120e 2L2G2, where the last inequality is due to 1+e (e−1)2 ≤2. C.3 Proof of Lemma B.8 We first give the bound of the compression error of each learner. Following the proof in Karimireddy et al. [2019], we have EC ∥et+1∥2 ≤ 1 n nX i=1 EC ∥et+1 i ∥2 = 1 n nX i=1 EC ∥et i +α tgt i −FCC(e t i +α tgt i,C(·), L)∥ 2 ≤ 1 en nX i=1 EC ∥et i +α tgt i∥2 ≤(1 +β