Communication-Efficient Approximate Gradient Coding for Distributed Learning in Heterogeneous Systems

Heekang Song; Wan Choi

arxiv: 2605.15890 · v1 · pith:JXQPBWWDnew · submitted 2026-05-15 · 📡 eess.SY · cs.SY

Communication-Efficient Approximate Gradient Coding for Distributed Learning in Heterogeneous Systems

Heekang Song , Wan Choi This is my paper

Pith reviewed 2026-05-20 16:33 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords gradient codingdistributed learningstraggler resiliencequantizationheterogeneous systemscommunication efficiencybit allocationconvergence analysis

0 comments

The pith

A joint optimization of gradient coding and quantization yields a closed-form structure that minimizes residual error in heterogeneous distributed learning under an unbiasedness constraint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework for gradient coding in distributed learning that accounts for heterogeneous worker speeds and communication limits. By formulating an optimization to reduce residual error while keeping the estimator unbiased, it derives an exact optimal code structure paired with the best bit allocation. This approach includes a practical low-complexity algorithm for near-optimal results and proves convergence for convex smooth loss functions. Experiments confirm quicker convergence and lower communication costs on image datasets.

Core claim

We rigorously establish the joint global optimum by deriving a closed-form code structure coupled with an optimal bit allocation strategy, while simultaneously proposing a low-complexity bit allocation algorithm that efficiently yields near-optimal performance. We provide rigorous convergence analysis for convex and smooth functions.

What carries the argument

The unified framework optimizing both gradient coding and quantization to minimize residual error subject to an unbiasedness constraint.

If this is right

Convergence guarantees hold for convex and smooth objective functions in distributed settings.
The low-complexity bit allocation algorithm achieves performance close to the optimal strategy.
Empirical tests on the COCO dataset show accelerated convergence and improved communication efficiency over baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such structures might extend to dynamic environments where worker heterogeneity changes over time.
The approach could lower overall energy use in large-scale training clusters by reducing bits transmitted.
Similar joint designs may benefit other coding schemes beyond gradients, like in federated learning variants.

Load-bearing premise

The optimization problem to minimize residual error subject to an unbiasedness constraint admits a closed-form joint global optimum for heterogeneous systems.

What would settle it

Running the proposed scheme on a heterogeneous cluster and observing that the residual error does not match the predicted minimum or that convergence does not accelerate would falsify the joint optimality claim.

Figures

Figures reproduced from arXiv: 2605.15890 by Heekang Song, Wan Choi.

**Figure 1.** Figure 1: The objective value F(r) for (b-SP1) with respect to the bit budget Zres: (a) k = 10 and (b) k = 50. COCO dataset, with a learning rate of γt = 0.05. Let τth represent the response time deadline for a training iteration. A worker node i ∈ [1 : k] is identified as a straggler if its total delay τi , comprising local gradient computation and communication time, exceeds this threshold (i.e., τi > τth). Conseq… view at source ↗

**Figure 2.** Figure 2: Convergence graph with respect to the training itera [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Convergence graph with respect to the training itera [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Detected objects of sampled image: (a) SGD (b) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We propose a communication-efficient optimally structured gradient coding scheme to jointly address straggler resilience and communication efficiency in heterogeneous distributed learning. By establishing a unified framework that simultaneously optimizes gradient coding and quantization, we formulate an optimization problem to minimize residual error subject to an unbiasedness constraint. We rigorously establish the joint global optimum by deriving a closed-form code structure coupled with an optimal bit allocation strategy, while simultaneously proposing a low-complexity bit allocation algorithm that efficiently yields near-optimal performance. We provide rigorous convergence analysis for convex and smooth functions. Experiments on the COCO dataset demonstrate that our joint design significantly accelerates convergence and enhances communication efficiency compared to existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a joint closed-form design for gradient coding plus quantization that handles stragglers and communication in heterogeneous setups.

read the letter

The main takeaway is that the authors built a single optimization framework that picks both the coding structure and the bit allocation at once for heterogeneous workers. They minimize residual error under an unbiasedness constraint and report a closed-form solution for the codes along with an optimal bit strategy. A low-complexity algorithm is added so the bit part stays usable in practice. Convergence is proved for convex smooth losses, and COCO experiments show faster convergence and lower communication than the usual separate baselines.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a communication-efficient approximate gradient coding scheme for distributed learning in heterogeneous systems. It establishes a unified framework that jointly optimizes gradient coding and quantization by formulating an optimization problem to minimize residual error subject to an unbiasedness constraint. The authors claim to rigorously establish the joint global optimum via a closed-form code structure and optimal bit allocation strategy, propose a low-complexity bit allocation algorithm, provide convergence analysis for convex and smooth functions, and report improved performance on the COCO dataset.

Significance. If the closed-form derivations and convergence guarantees hold, the work would advance distributed optimization by delivering a theoretically grounded joint solution to straggler resilience and communication efficiency under heterogeneity. The combination of an optimal structure, a practical algorithm, and experimental validation on COCO constitutes a concrete contribution that could inform system design in large-scale learning.

major comments (1)

[Section 3] Section 3 (optimization formulation): the claim that the problem of minimizing residual error subject to unbiasedness admits a closed-form joint global optimum over code structure and bit allocation requires the explicit Lagrangian, stationarity conditions, and verification that the solution is global rather than a critical point; without these steps it is unclear whether heterogeneity introduces non-convexity that precludes the asserted closed form.

minor comments (2)

[Convergence analysis] The convergence theorem statement would be strengthened by explicitly listing the smoothness and strong-convexity constants used in the rate.
[Experiments] Figure captions for the COCO experiments should include the number of independent runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review, positive assessment of the contribution, and recommendation for minor revision. We address the major comment on the optimization formulation below.

read point-by-point responses

Referee: [Section 3] Section 3 (optimization formulation): the claim that the problem of minimizing residual error subject to unbiasedness admits a closed-form joint global optimum over code structure and bit allocation requires the explicit Lagrangian, stationarity conditions, and verification that the solution is global rather than a critical point; without these steps it is unclear whether heterogeneity introduces non-convexity that precludes the asserted closed form.

Authors: We appreciate the referee's request for greater transparency in the optimality proof. In the manuscript, the joint problem is solved by first deriving a closed-form gradient coding structure that minimizes the residual error expression for arbitrary bit allocations while enforcing unbiasedness; this structure turns out to be independent of the per-worker heterogeneity parameters. The remaining bit-allocation subproblem is then formulated as a convex optimization task whose objective is quadratic in the quantization noise variances. Because the subproblem is strictly convex, any critical point obtained from the KKT conditions is the unique global minimizer. In the revised version we will explicitly display the Lagrangian of the bit-allocation problem, the full set of stationarity conditions, and the second-order verification (positive-definite Hessian) that confirms globality. This separation also shows that heterogeneity does not induce non-convexity in the overall formulation, as it is confined to the convex bit-allocation stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified in the derivation

full rationale

The paper formulates an optimization problem minimizing residual error subject to an unbiasedness constraint, then claims a closed-form joint global optimum for code structure and bit allocation in heterogeneous systems. The provided abstract and summary contain no equations, Lagrangians, or derivation steps that reduce the claimed optimum to a fitted parameter, self-referential definition, or input by construction. Convergence analysis for convex smooth functions and COCO experiments are presented as separate validations. No load-bearing self-citation chain or ansatz smuggling is evident from the text. The derivation therefore remains self-contained within the stated unified framework rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full text unavailable so ledger entries are limited to elements explicitly named in the abstract.

axioms (2)

domain assumption The distributed system is heterogeneous and subject to stragglers.
Stated in title and abstract as the setting for the scheme.
domain assumption An unbiasedness constraint is imposed on the gradient estimates.
Explicitly used to formulate the optimization problem.

pith-pipeline@v0.9.0 · 5628 in / 1349 out tokens · 76239 ms · 2026-05-20T16:33:59.946715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulate an optimization problem to minimize residual error subject to an unbiasedness constraint... derive a closed-form code structure coupled with an optimal bit allocation strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 3 internal anchors

[1]

MapReduce: Simpliﬁed data proc essing on large clusters,

J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed data proc essing on large clusters,” Commun. ACM , vol. 51, no. 1, pp. 107–113, 2008

work page 2008
[2]

Spark: Cluster computing with working sets,

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Top. Cloud Comput. , Boston, MA, USA, Jun. 2010, pp. 107–113

work page 2010
[3]

Coded matrix computation in wireless network,

K. Son and W. Choi, “Coded matrix computation in wireless network,” IEEE Trans. Wireless Commun. , vol. 23, no. 6, pp. 6394–6410, Jun. 2024

work page 2024
[4]

Joint design of shufﬂing and func- tion assignment for heterogeneous coded distributed compu ting,

H. Song, K. Son, and W. Choi, “Joint design of shufﬂing and func- tion assignment for heterogeneous coded distributed compu ting,” IEEE Trans. Signal Process. , vol. 70, pp. 2560–2575, May 2022

work page 2022
[5]

Gradient coding: Avoiding stragglers in distributed learning,

R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “ Gradient coding: Avoiding stragglers in distributed learning,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2017, pp. 3368–3376

work page 2017
[6]

Communication–computation efﬁcient gradient coding,

M. Y e and E. Abbe, “Communication–computation efﬁcient gradient coding,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2018, p. 9716

work page 2018
[7]

Gradient coding from cyclic MDS codes and expander graphs,

N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient coding from cyclic MDS codes and expander graphs,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2018, pp. 4302–4310

work page 2018
[8]

Appro ximate gradient coding via sparse random graphs,

Z. Charles, D. Papailiopoulos, and J. Ellenberg, “Appro ximate gradient coding via sparse random graphs,” arXiv preprint arXiv:171 1.06771, 2017

work page 2017
[9]

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

H. Wang, Z. Charles, and D. Papailiopoulos, “ErasureHea d: Distributed gradient descent without delays using approximate gradien t coding,” arXiv preprint arXiv:1901.09671, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[10]

Fundamental Limits of Approximate Gradient Coding

S. Wang, J. Liu, and N. Shroff, “Fundamental limits of ap proximate gradient coding,” arXiv preprint arXiv:1901.08166, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[11]

Approximate gradient codi ng with optimal decoding,

M. Glasgow and M. Wootters, “Approximate gradient codi ng with optimal decoding,” IEEE J. Sel. Areas Inf. Theory , vol. 2, no. 3, pp. 855–866, Sept. 2021

work page 2021
[12]

Stochastic g radient coding for straggler mitigation in distributed learning,

R. Bitar, M. Wootters, and S. El Rouayheb, “Stochastic g radient coding for straggler mitigation in distributed learning,” IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 277–291, May 2020

work page 2020
[13]

Distributed learning based on 1- bit gradient coding in the presence of stragglers,

C. Li and M. Skoglund, “Distributed learning based on 1- bit gradient coding in the presence of stragglers,” IEEE Trans. Commun. , vol. 72, no. 8, pp. 4903–4916, Aug. 2024

work page 2024
[14]

Approximate gradient coding for di stributed learning with heterogeneous stragglers,

H. Song and W. Choi, “Approximate gradient coding for di stributed learning with heterogeneous stragglers,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2025

work page 2025
[15]

Regulated subspace projection bas ed local model update compression for communication-efﬁcient federated learning,

S. Park and W. Choi, “Regulated subspace projection bas ed local model update compression for communication-efﬁcient federated learning,” IEEE J. Sel. Areas Commun. , vol. 41, no. 4, pp. 964–976, Apr. 2023

work page 2023
[16]

CVX: Matlab software for discipli ned convex programming, v2.1,

M. Grant and S. Boyd, “CVX: Matlab software for discipli ned convex programming, v2.1,” Mar. 2014. [Online]. Available: http: //cvxr.com/cvx

work page 2014
[17]

Y ALMIP: A toolbox for modeling and optimi zation in MA TLAB,

J. L ¨ ofberg, “Y ALMIP: A toolbox for modeling and optimi zation in MA TLAB,” in Proc. IEEE Int. Conf. Robot. Autom. , May 2004, pp. 284–289

work page 2004
[18]

Timely distributed comput ation with stragglers,

B. Buyukates and S. Ulukus, “Timely distributed comput ation with stragglers,” IEEE Trans. Commun. , vol. 68, no. 9, pp. 5273–5282, Sept. 2020

work page 2020
[19]

Speeding up distributed machine learning using codes,

K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. R amchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf. Theory , vol. 64, no. 3, pp. 1514–1529, Mar. 2018

work page 2018
[20]

Coded computation over heterogeneous clusters,

A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avest imehr, “Coded computation over heterogeneous clusters,” IEEE Trans. Inf. Theory , vol. 65, no. 7, pp. 4227–4242, July 2019

work page 2019
[21]

QSGD: Communication-efﬁcient SGD via gradient quantization and encoding,

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. V ojnovi c, “QSGD: Communication-efﬁcient SGD via gradient quantization and encoding,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017

work page 2017
[22]

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradie nt descent optimal for strongly convex stochastic optimization,” arX iv preprint arXiv:1109.5647, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[23]

Microsoft COCO: Common objects in context,

T.-Y . Lin et al. , “Microsoft COCO: Common objects in context,” in Proc. Eur . Conf. Comput. Vis. , 2014, pp. 740–755

work page 2014
[24]

Stochastic gradient descent fo r non-smooth optimization: Convergence results and optimal averaging s chemes,

O. Shamir and T. Zhang, “Stochastic gradient descent fo r non-smooth optimization: Convergence results and optimal averaging s chemes,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2013, pp. 71–79

work page 2013
[25]

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, ”An analy sis of approximations for maximizing submodular set functions -I,” Math. Programm., vol. 14, no. 1, pp. 265-294, 1978

work page 1978
[26]

Optimal approximation for the submodular welfare problem in the value oracle model,

J. V ondr´ ak, “Optimal approximation for the submodular welfare problem in the value oracle model,” ACM STOC, pp. 67–74, 2008. 15

work page 2008
[27]

R. K. Iyer, S. Jegelka, and J. A. Bilmes, ”Curvature and o ptimal algorithms for learning and minimizing submodular functio ns,” Adv. Neural Inf. Process. Syst. , pp. 2742–2750, 2013

work page 2013
[28]

Optimal appro ximation for submodular and supermodular optimization with bounded cur vature,

M. Sviridenko, J. V ondr´ ak, and J. Ward, “Optimal appro ximation for submodular and supermodular optimization with bounded cur vature,” Math. of Oper . Res. , vol. 42, no. 4, pp. 1197–1218, 2017

work page 2017
[29]

Performance guarantees of forward and reverse greedy algorithms for minimizing nonsu permodular nonsubmodular functions on a matroid,

O. Karaca, D. Tihanyi, and M. Kamgarpour, “Performance guarantees of forward and reverse greedy algorithms for minimizing nonsu permodular nonsubmodular functions on a matroid,” arXiv, 2021

work page 2021

[1] [1]

MapReduce: Simpliﬁed data proc essing on large clusters,

J. Dean and S. Ghemawat, “MapReduce: Simpliﬁed data proc essing on large clusters,” Commun. ACM , vol. 51, no. 1, pp. 107–113, 2008

work page 2008

[2] [2]

Spark: Cluster computing with working sets,

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Top. Cloud Comput. , Boston, MA, USA, Jun. 2010, pp. 107–113

work page 2010

[3] [3]

Coded matrix computation in wireless network,

K. Son and W. Choi, “Coded matrix computation in wireless network,” IEEE Trans. Wireless Commun. , vol. 23, no. 6, pp. 6394–6410, Jun. 2024

work page 2024

[4] [4]

Joint design of shufﬂing and func- tion assignment for heterogeneous coded distributed compu ting,

H. Song, K. Son, and W. Choi, “Joint design of shufﬂing and func- tion assignment for heterogeneous coded distributed compu ting,” IEEE Trans. Signal Process. , vol. 70, pp. 2560–2575, May 2022

work page 2022

[5] [5]

Gradient coding: Avoiding stragglers in distributed learning,

R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “ Gradient coding: Avoiding stragglers in distributed learning,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2017, pp. 3368–3376

work page 2017

[6] [6]

Communication–computation efﬁcient gradient coding,

M. Y e and E. Abbe, “Communication–computation efﬁcient gradient coding,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2018, p. 9716

work page 2018

[7] [7]

Gradient coding from cyclic MDS codes and expander graphs,

N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, “Gradient coding from cyclic MDS codes and expander graphs,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2018, pp. 4302–4310

work page 2018

[8] [8]

Appro ximate gradient coding via sparse random graphs,

Z. Charles, D. Papailiopoulos, and J. Ellenberg, “Appro ximate gradient coding via sparse random graphs,” arXiv preprint arXiv:171 1.06771, 2017

work page 2017

[9] [9]

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

H. Wang, Z. Charles, and D. Papailiopoulos, “ErasureHea d: Distributed gradient descent without delays using approximate gradien t coding,” arXiv preprint arXiv:1901.09671, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[10] [10]

Fundamental Limits of Approximate Gradient Coding

S. Wang, J. Liu, and N. Shroff, “Fundamental limits of ap proximate gradient coding,” arXiv preprint arXiv:1901.08166, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[11] [11]

Approximate gradient codi ng with optimal decoding,

M. Glasgow and M. Wootters, “Approximate gradient codi ng with optimal decoding,” IEEE J. Sel. Areas Inf. Theory , vol. 2, no. 3, pp. 855–866, Sept. 2021

work page 2021

[12] [12]

Stochastic g radient coding for straggler mitigation in distributed learning,

R. Bitar, M. Wootters, and S. El Rouayheb, “Stochastic g radient coding for straggler mitigation in distributed learning,” IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 277–291, May 2020

work page 2020

[13] [13]

Distributed learning based on 1- bit gradient coding in the presence of stragglers,

C. Li and M. Skoglund, “Distributed learning based on 1- bit gradient coding in the presence of stragglers,” IEEE Trans. Commun. , vol. 72, no. 8, pp. 4903–4916, Aug. 2024

work page 2024

[14] [14]

Approximate gradient coding for di stributed learning with heterogeneous stragglers,

H. Song and W. Choi, “Approximate gradient coding for di stributed learning with heterogeneous stragglers,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2025

work page 2025

[15] [15]

Regulated subspace projection bas ed local model update compression for communication-efﬁcient federated learning,

S. Park and W. Choi, “Regulated subspace projection bas ed local model update compression for communication-efﬁcient federated learning,” IEEE J. Sel. Areas Commun. , vol. 41, no. 4, pp. 964–976, Apr. 2023

work page 2023

[16] [16]

CVX: Matlab software for discipli ned convex programming, v2.1,

M. Grant and S. Boyd, “CVX: Matlab software for discipli ned convex programming, v2.1,” Mar. 2014. [Online]. Available: http: //cvxr.com/cvx

work page 2014

[17] [17]

Y ALMIP: A toolbox for modeling and optimi zation in MA TLAB,

J. L ¨ ofberg, “Y ALMIP: A toolbox for modeling and optimi zation in MA TLAB,” in Proc. IEEE Int. Conf. Robot. Autom. , May 2004, pp. 284–289

work page 2004

[18] [18]

Timely distributed comput ation with stragglers,

B. Buyukates and S. Ulukus, “Timely distributed comput ation with stragglers,” IEEE Trans. Commun. , vol. 68, no. 9, pp. 5273–5282, Sept. 2020

work page 2020

[19] [19]

Speeding up distributed machine learning using codes,

K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. R amchandran, “Speeding up distributed machine learning using codes,” IEEE Trans. Inf. Theory , vol. 64, no. 3, pp. 1514–1529, Mar. 2018

work page 2018

[20] [20]

Coded computation over heterogeneous clusters,

A. Reisizadeh, S. Prakash, R. Pedarsani, and A. S. Avest imehr, “Coded computation over heterogeneous clusters,” IEEE Trans. Inf. Theory , vol. 65, no. 7, pp. 4227–4242, July 2019

work page 2019

[21] [21]

QSGD: Communication-efﬁcient SGD via gradient quantization and encoding,

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. V ojnovi c, “QSGD: Communication-efﬁcient SGD via gradient quantization and encoding,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017

work page 2017

[22] [22]

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradie nt descent optimal for strongly convex stochastic optimization,” arX iv preprint arXiv:1109.5647, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[23] [23]

Microsoft COCO: Common objects in context,

T.-Y . Lin et al. , “Microsoft COCO: Common objects in context,” in Proc. Eur . Conf. Comput. Vis. , 2014, pp. 740–755

work page 2014

[24] [24]

Stochastic gradient descent fo r non-smooth optimization: Convergence results and optimal averaging s chemes,

O. Shamir and T. Zhang, “Stochastic gradient descent fo r non-smooth optimization: Convergence results and optimal averaging s chemes,” in Proc. Int. Conf. Mach. Learn. (ICML) , 2013, pp. 71–79

work page 2013

[25] [25]

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, ”An analy sis of approximations for maximizing submodular set functions -I,” Math. Programm., vol. 14, no. 1, pp. 265-294, 1978

work page 1978

[26] [26]

Optimal approximation for the submodular welfare problem in the value oracle model,

J. V ondr´ ak, “Optimal approximation for the submodular welfare problem in the value oracle model,” ACM STOC, pp. 67–74, 2008. 15

work page 2008

[27] [27]

R. K. Iyer, S. Jegelka, and J. A. Bilmes, ”Curvature and o ptimal algorithms for learning and minimizing submodular functio ns,” Adv. Neural Inf. Process. Syst. , pp. 2742–2750, 2013

work page 2013

[28] [28]

Optimal appro ximation for submodular and supermodular optimization with bounded cur vature,

M. Sviridenko, J. V ondr´ ak, and J. Ward, “Optimal appro ximation for submodular and supermodular optimization with bounded cur vature,” Math. of Oper . Res. , vol. 42, no. 4, pp. 1197–1218, 2017

work page 2017

[29] [29]

Performance guarantees of forward and reverse greedy algorithms for minimizing nonsu permodular nonsubmodular functions on a matroid,

O. Karaca, D. Tihanyi, and M. Kamgarpour, “Performance guarantees of forward and reverse greedy algorithms for minimizing nonsu permodular nonsubmodular functions on a matroid,” arXiv, 2021

work page 2021