pith. sign in

arxiv: 2501.06793 · v6 · submitted 2025-01-12 · 📡 eess.SY · cs.SY

Differentially Private Gradient-Tracking-Based Distributed Stochastic Optimization over Directed Graphs

Pith reviewed 2026-05-23 05:11 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords differentially private optimizationgradient trackingdistributed stochastic optimizationdirected graphsnonconvex objectivesPolyak-Lojasiewicz conditionsubsamplingconvergence rates
0
0 comments X

The pith

A gradient-tracking algorithm adds noise for differential privacy in distributed stochastic optimization over directed graphs while still converging almost surely and in mean square for nonconvex objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a distributed optimization method over directed graphs in which each agent adds privacy noise to its state and tracking variable before sending them to neighbors. Two schemes adjust step sizes and sampling counts, using a subsampling technique that raises the privacy level and keeps the total privacy budget finite even after unlimited iterations. The method proves both almost-sure and mean-square convergence for nonconvex problems, and supplies polynomial or exponential mean-square rates under an extra structural condition on the objective. It also displays the resulting privacy-convergence trade-off and tests the approach on distributed training of neural networks.

Core claim

By incorporating privacy noises into each agent's state and tracking variable and employing two novel schemes for step-sizes and sampling numbers with sampling parameter-controlled subsampling, the algorithm ensures both almost sure and mean square convergence for nonconvex objectives over directed graphs, while keeping the cumulative privacy budget finite over infinite iterations; under the Polyak-Lojasiewicz condition, it achieves polynomial or exponential mean square convergence rates depending on the scheme.

What carries the argument

Gradient-tracking mechanism with added privacy noise on transmitted states and tracking variables, stabilized by two step-size and sampling schemes that incorporate controlled subsampling.

If this is right

  • The algorithm converges almost surely and in mean square for nonconvex objectives.
  • Scheme (S1) achieves a polynomial mean square convergence rate while Scheme (S2) achieves an exponential rate when the objective satisfies the Polyak-Lojasiewicz condition.
  • The sampling parameter-controlled subsampling method keeps the cumulative privacy budget finite even over infinite iterations.
  • A concrete trade-off exists between the level of privacy and the speed of convergence that can be tuned through the choice of scheme and parameters.
  • Numerical tests on distributed training with MNIST and CIFAR-10 show better performance than prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finite cumulative privacy budget could make the method suitable for long-running or streaming optimization tasks on networks.
  • The noise-injection approach might be adapted to other distributed first-order methods that rely on tracking or consensus steps.
  • Performance on graphs with time-varying or switching directed edges remains an open question that could be tested directly.
  • The same subsampling idea could be combined with other privacy mechanisms such as clipping or quantization in future variants.

Load-bearing premise

The directed graph keeps the gradient-tracking process stable after privacy noise is added to every transmitted state and tracking variable, and the two schemes can be chosen so that convergence and a finite cumulative privacy loss hold at the same time.

What would settle it

Numerical runs of either scheme on the MNIST or CIFAR-10 distributed training tasks that show either failure of almost-sure or mean-square convergence or unbounded growth of the cumulative privacy budget would disprove the claims.

read the original abstract

This paper proposes a differentially private gradient-tracking-based distributed stochastic optimization algorithm over directed graphs. In particular, privacy noises are incorporated into each agent's state and tracking variable to mitigate information leakage, after which the perturbed states and tracking variables are transmitted to neighbors. We design two novel schemes for the step-sizes and the sampling number within the algorithm. The sampling parameter-controlled subsampling method employed by both schemes enhances the differential privacy level, and ensures a finite cumulative privacy budget even over infinite iterations. The algorithm achieves both almost sure and mean square convergence for nonconvex objectives. Furthermore, when nonconvex objectives satisfy the Polyak-Lojasiewicz condition, Scheme (S1) achieves a polynomial mean square convergence rate, and Scheme (S2) achieves an exponential mean square convergence rate. The trade-off between privacy and convergence is presented. The effectiveness of the algorithm and its superior performance compared to existing works are illustrated through numerical examples of distributed training on the benchmark datasets "MNIST" and "CIFAR-10".

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a differentially private gradient-tracking algorithm for distributed stochastic optimization over directed graphs. Privacy noise is added to each agent's state and tracking variable before transmission. Two schemes (S1, S2) are introduced for step-size and sampling-parameter sequences that employ subsampling to guarantee a finite cumulative privacy budget over infinite iterations while preserving convergence. The paper claims almost-sure and mean-square convergence for general non-convex objectives; under the Polyak-Łojasiewicz condition, S1 yields a polynomial mean-square rate and S2 an exponential rate. Numerical results on MNIST and CIFAR-10 distributed training are provided to illustrate performance and privacy-convergence trade-offs.

Significance. If the convergence analysis correctly accounts for the persistent additive perturbations introduced by independent privacy noise on both the state and tracking variables, the finite cumulative privacy budget and the explicit rates under PL would constitute a useful contribution to private distributed optimization on directed networks. The numerical validation on standard ML benchmarks strengthens the practical relevance.

major comments (2)
  1. [§4.2] §4.2 (Tracking-error recursion): the derivation of the disagreement dynamics must explicitly bound the effect of the independent privacy noises added to both x_i and v_i at every transmission; standard row-stochastic cancellation no longer holds exactly, and it is unclear whether the extra term remains dominated by the step-size schedule used for a.s. convergence.
  2. [Theorem 4.3] Theorem 4.3 (PL case, Scheme S2): the exponential mean-square rate is stated to hold simultaneously with finite cumulative privacy loss; the proof must verify that the sampling-parameter growth required for the privacy budget does not violate the step-size conditions needed to absorb the noise-induced tracking error inside the PL Lyapunov function.
minor comments (3)
  1. [§3] Notation for the two schemes (S1, S2) should be introduced once in §3 and used consistently; currently the sampling-parameter sequence is defined differently in the privacy and convergence sections.
  2. [Figure 3] Figure 3 (privacy budget vs. iterations): the y-axis scale and the exact privacy parameter ε used in the plot should be stated explicitly in the caption.
  3. [§5] The statement that the algorithm is 'superior to existing works' in the numerical section should be supported by a direct comparison table that includes both convergence speed and achieved privacy budget for the baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important points in the convergence analysis that we will address explicitly in the revision.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Tracking-error recursion): the derivation of the disagreement dynamics must explicitly bound the effect of the independent privacy noises added to both x_i and v_i at every transmission; standard row-stochastic cancellation no longer holds exactly, and it is unclear whether the extra term remains dominated by the step-size schedule used for a.s. convergence.

    Authors: We agree that the independent privacy noises on both the state and tracking variables require explicit treatment. In the revised Section 4.2 we will expand the disagreement recursion to isolate the two additional noise terms, derive a uniform bound on their contribution using the row-stochastic property and the chosen step-size and sampling sequences, and verify that these terms remain summable under the conditions already stated for almost-sure convergence. revision: yes

  2. Referee: [Theorem 4.3] Theorem 4.3 (PL case, Scheme S2): the exponential mean-square rate is stated to hold simultaneously with finite cumulative privacy loss; the proof must verify that the sampling-parameter growth required for the privacy budget does not violate the step-size conditions needed to absorb the noise-induced tracking error inside the PL Lyapunov function.

    Authors: We will augment the proof of Theorem 4.3 with an explicit verification step showing that the growth rate of the sampling parameter in Scheme S2 is compatible with the step-size conditions required to dominate the noise-induced tracking error inside the PL Lyapunov function. The revised proof will list the concrete inequalities that the chosen sequences satisfy simultaneously. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation chain not reducible to inputs

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivation steps. Convergence claims (a.s. and mean-square rates under nonconvex and PL conditions) are stated as outcomes of the two step-size/sampling schemes without any reduction to self-defined quantities, self-citations, or renamed empirical patterns. The privacy budget finiteness via subsampling is presented as a design property rather than a fitted prediction. No load-bearing step is shown to collapse by construction, so the analysis remains self-contained against external stochastic-optimization and privacy benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; full manuscript would be required to populate the ledger.

pith-pipeline@v0.9.0 · 5708 in / 1228 out tokens · 48104 ms · 2026-05-23T05:11:15.589304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    Push-pull gradient methods for distributed optimization in networks,

    S. Pu, W. Shi, J. M. Xu, and A. Nedi ´c, “Push-pull gradient methods for distributed optimization in networks,” IEEE Trans. Autom. Control, vol. 66, no. 1, pp. 1–16, 2020

  2. [2]

    Nonasymptotic bounds for stochastic optimization with biased noisy gradient oracles,

    N. Bhavsar and L. A. Prashanth, “Nonasymptotic bounds for stochastic optimization with biased noisy gradient oracles,” IEEE Trans. Autom. Control, vol. 68, no. 3, pp. 1628–1641, 2023

  3. [3]

    Distributed stochastic optimization and learning,

    O. Shamir and N. Srebro, “Distributed stochastic optimization and learning,” in 52nd Annu. Allerton Conf. Commun. Control Comput. , Monticello, IL, USA, 2014, pp. 850–857

  4. [4]

    DGLB: distributed stochastic geographical load balancing over cloud networks,

    T. Y . Chen, A. G. Marques, and G. B. Giannakis, “DGLB: distributed stochastic geographical load balancing over cloud networks,” IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 7, pp. 1866–1880, 2016. 24

  5. [5]

    Design of multipath transmission control for information-centric Internet of Things: a distributed stochastic optimization framework,

    M. Wang, C. Q. Xu, X. Y . Chen, H. Hao, L. J. Zhong, and D. O. Wu, “Design of multipath transmission control for information-centric Internet of Things: a distributed stochastic optimization framework,” IEEE Internet Things J. , vol. 6, no. 6, pp. 9475–9488, 2019

  6. [6]

    Convergence rates of distributed gradient methods under random quantization: a stochastic approximation approach,

    T. T. Doan, S. T. Maguluri, and J. Romberg, “Convergence rates of distributed gradient methods under random quantization: a stochastic approximation approach,” IEEE Trans. Autom. Control, vol. 66, no. 10, pp. 4469–4484, 2021

  7. [7]

    Convergence in high probability of distributed stochastic gradient descent algorithms,

    K. H. Lu, H. X. Wang, H. S. Zhang, and L. Wang, “Convergence in high probability of distributed stochastic gradient descent algorithms,” IEEE Trans. Autom. Control , vol. 69, no. 4, pp. 2189–2204, 2024

  8. [8]

    Distributed convex optimization with inequality constraints over time-varying unbalanced digraphs,

    P. Xie, K. Y . You, R. Tempo, S. J. Song, and C. Wu,“Distributed convex optimization with inequality constraints over time-varying unbalanced digraphs,” IEEE Trans. Autom. Control, vol. 63, no. 12, pp. 4331–4337, 2018

  9. [9]

    Harnessing smoothness to accelerate distributed optimization,

    G. N. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245– 1260, 2018

  10. [10]

    Variance-reduced decentralized stochastic optimization with accelerated convergence,

    R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralized stochastic optimization with accelerated convergence,” IEEE Trans. Signal Process., vol. 68, pp. 6255–6271, 2020

  11. [11]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedi ´c, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, no. 1, pp. 409–457, 2021

  12. [12]

    An improved analysis of gradient tracking for decentralized machine learning,

    A. Koloskova, T. Lin, and S. U. Stich, “An improved analysis of gradient tracking for decentralized machine learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 11422–11435

  13. [13]

    Distributed stochastic optimization with gradient tracking over strongly-connected networks,

    R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over strongly-connected networks,” in IEEE 58th Conf. Decis. Control , Nice, France, 2019, pp. 8353–8358

  14. [14]

    Distributed variable sample- size stochastic optimization with fixed step-sizes,

    J. L. Lei, P. Yi, J. Chen, and Y . G. Hong, “Distributed variable sample- size stochastic optimization with fixed step-sizes,” IEEE Trans Autom. Control, vol. 67, no. 10, pp. 5630–5637, 2022

  15. [15]

    Asymptotic properties of S-AB method with diminishing step-size,

    S. C. Zhao and Y . C. Liu, “Asymptotic properties of S-AB method with diminishing step-size,” IEEE Trans. Autom. Control , vol. 69, no. 5, pp. 3222-3229, 2024

  16. [16]

    Accelerated distributed stochas- tic non-convex optimization over time-varying directed networks,

    Y . Y . Chen, A. Hashemi, and H. Vikalo, “Accelerated distributed stochas- tic non-convex optimization over time-varying directed networks,” IEEE Trans. Autom. Control, vol. 70, no. 4, pp. 2196-2211, 2025

  17. [17]

    Gradient-tracking-based distributed optimiza- tion with guaranteed optimality under noisy information sharing,

    Y . Q. Wang and T. Bas ¸ar, “Gradient-tracking-based distributed optimiza- tion with guaranteed optimality under noisy information sharing,” IEEE Trans. Autom. Control, vol. 68, no. 8, pp. 4796–4811, 2023

  18. [18]

    Model inversion attacks that exploit confidence information and basic countermeasures

    M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that exploit confidence information and basic countermeasures”, in Conf. Comput. Commun. Secur., Denver, CO, USA, 2015, pp. 1322–1333

  19. [19]

    Deep leakage from gradients,

    L. G. Zhu, Z. J. Liu, and S. Han, “Deep leakage from gradients,” in Proc. Adv. Neural Inf. Process Syst. , Vancouver, Canada, 2019, vol. 32, pp. 14774–14784

  20. [20]

    Privacy security in control systems,

    J. F. Zhang, J. W. Tan, and J. M. Wang, “Privacy security in control systems,” Sci. China Inf. Sci. , vol. 64, no. 7, 2021, Art. no. 176201

  21. [21]

    Privacy preserving distributed optimization using homomorphic encryption,

    Y . Lu and M. H. Zhu, “Privacy preserving distributed optimization using homomorphic encryption,” Automatica, vol. 96, pp. 314–325, 2018

  22. [22]

    Cooperative secure parameter identification of multi-participant ARX systems – a threshold Pail- lier cryptosystem-based least-squares identification algorithm,

    J. W. Tan, J. M. Wang, and J. F. Zhang, “Cooperative secure parameter identification of multi-participant ARX systems – a threshold Pail- lier cryptosystem-based least-squares identification algorithm,” Scientia Sinica Informationis, vol. 53, no. 12, 2472–2492, 2023

  23. [23]

    Privacy-preserving average consensus via state decompo- sition,

    Y . Q. Wang, “Privacy-preserving average consensus via state decompo- sition,” IEEE Trans. Autom. Control , vol. 64, no. 11, pp. 4711–4716, 2019

  24. [24]

    Dynamics based privacy preser- vation in decentralized optimization,

    H. Gao, Y . Q. Wang, and A. Nedi ´c, “Dynamics based privacy preser- vation in decentralized optimization,” Automatica, vol. 151, 2023, Art. no. 110878

  25. [25]

    Decentralized gradient methods with time-varying uncoordinated stepsizes: convergence analysis and privacy design,

    Y . Q. Wang and A. Nedi ´c, “Decentralized gradient methods with time-varying uncoordinated stepsizes: convergence analysis and privacy design,” IEEE Trans. Autom. Control , vol. 69, no. 9, pp. 5352–5367, 2024

  26. [26]

    Privacy preserving average consensus through network augmentation,

    G. Ramos, A. P. Aguiarz, S. Kar, and S. Pequito, “Privacy preserving average consensus through network augmentation,” IEEE Trans. Autom. Control, vol. 69, no. 10, pp. 6907–6919, 2024

  27. [27]

    Privacy preserving average consensus,

    Y . L. Mo and R. M. Murray, “Privacy preserving average consensus,” IEEE Trans. Autom. Control , vol. 62, no. 2, pp. 753–765, 2017

  28. [28]

    The algorithmic foundations of differential privacy,

    C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Found. Trends Theor. Comput. Sci., vol. 9, nos. 3–4, pp. 211– 407, 2014

  29. [29]

    Differentially private filtering,

    J. Le Ny and G. J. Pappas, “Differentially private filtering,” IEEE Trans. Autom. Control, vol. 59, no. 2, pp. 341–354, 2014

  30. [30]

    State estima- tion with protecting exogenous inputs via Cram ´er-Rao lower bound approach,

    L. P. Guo, J. M. Wang, Y . L. Zhao, and J. F. Zhang, “State estima- tion with protecting exogenous inputs via Cram ´er-Rao lower bound approach,” 2025, arXiv:2410.08756v2

  31. [31]

    Differentially private distributed optimization via state and direction perturbation in multiagent systems,

    T. Ding, S. Y . Zhu, J. P. He, C. L. Chen, and X. P. Guan, “Differentially private distributed optimization via state and direction perturbation in multiagent systems,” IEEE Trans. Autom. Control , vol. 67, no. 2, pp. 722–737, 2022

  32. [32]

    Gradient-tracking based differentially private distributed optimization with enhanced optimization accuracy,

    Y . Xuan and Y . Q. Wang, “Gradient-tracking based differentially private distributed optimization with enhanced optimization accuracy,”Automat- ica, vol. 155, 2023, Art. no. 111150

  33. [33]

    Tailoring gradient methods for differentially private distributed optimization,

    Y . Q. Wang and A. Nedi ´c, “Tailoring gradient methods for differentially private distributed optimization,” IEEE Trans. Autom. Control , vol. 69, no. 2, pp. 872–887, 2024

  34. [34]

    Differential privacy in distributed optimization with gradient tracking,

    L. Y . Huang, J. F. Wu, D. W. Shi, S. Dey, and L. Shi, “Differential privacy in distributed optimization with gradient tracking,” IEEE Trans. Autom. Control, vol. 69, no. 9, pp. 5727–5742, 2024

  35. [35]

    Differentially private and communication-efficient distributed nonconvex optimization algorithms,

    A. T. Xie, X. L. Yi, X. F. Wang, M. Cao, and X. Q. Ren, “Differentially private and communication-efficient distributed nonconvex optimization algorithms,” Automatica, vol. 177, 2025, Art. no. 112338

  36. [36]

    Differentially private dual gradient tracking for distributed resource allocation,

    W. Huo, X. M. Chen, L. Y . Huang, K. H. Johansson, and L. Shi, “Differentially private dual gradient tracking for distributed resource allocation,” Automatica, vol. 182, 2025, Art. no. 112521

  37. [37]

    Differentially private and communication efficient collaborative learning,

    J. H. Ding, G. N. Liang, J. B. Bi, and M. Pan, “Differentially private and communication efficient collaborative learning,” inProc. AAAI Conf. Artif. Intell., Palo Alto, CA, USA, 2021, vol. 35, no. 8, pp. 7219–7227

  38. [38]

    Weighted distributed differential privacy ERM: convex and non-convex,

    Y . L. Kang, Y . Liu, B. Niu, and W. P. Wang, “Weighted distributed differential privacy ERM: convex and non-convex,”Comput. Secur., vol. 106, 2021, Art. no. 102275

  39. [39]

    A(DP) 2SGD: asynchronous decentral- ized parallel stochastic gradient descent with differential privacy,

    J. Xu, W. Zhang, and F. Wang, “A(DP) 2SGD: asynchronous decentral- ized parallel stochastic gradient descent with differential privacy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 8036–8047, 2022

  40. [40]

    Decentralized nonconvex optimization with guaranteed privacy and accuracy,

    Y . Q. Wang and T. Bas ¸ar, “Decentralized nonconvex optimization with guaranteed privacy and accuracy,” Automatica, vol. 150, 2023, Art. no. 110858

  41. [41]

    Quantization enabled privacy protection in decentralized stochastic optimization,

    Y . Q. Wang and T. Bas ¸ar, “Quantization enabled privacy protection in decentralized stochastic optimization,” IEEE Trans. Autom. Control, vol. 68, no. 7, pp. 4038–4052, 2023

  42. [42]

    Killing two birds with one stone: quantization achieves privacy in distributed learning,

    G. F. Yan, T. Li, K. Wu, and L. Q. Song, “Killing two birds with one stone: quantization achieves privacy in distributed learning,” Digit. Signal Process., vol. 146, 2024, Art. no. 104353

  43. [43]

    Distributed empirical risk minimization with differential privacy,

    C. X. Liu, K. H. Johansson, and Y . Shi, “Distributed empirical risk minimization with differential privacy,”Automatica, vol. 162, 2024, Art. no. 111514

  44. [44]

    Locally differentially private distributed online learning with guaranteed optimality,

    Z. Q. Chen and Y . Q. Wang, “Locally differentially private distributed online learning with guaranteed optimality,” IEEE Trans. Autom. Con- trol, vol. 70, no. 4, pp. 2521–2536, 2025

  45. [45]

    Locally differentially private gradient tracking for distributed online learning over directed graphs,

    Z. Q. Chen and Y . Q. Wang, “Locally differentially private gradient tracking for distributed online learning over directed graphs,” IEEE Trans. Autom. Control, vol. 70, no. 5, pp. 3040–3055, 2025

  46. [46]

    Differentially private distributed stochastic optimization with time-varying sample sizes,

    J. M. Wang and J. F. Zhang, “Differentially private distributed stochastic optimization with time-varying sample sizes,” IEEE Trans. Autom. Control, vol. 69, no. 9, pp. 6341–6348, 2024

  47. [47]

    Differentially private distributed nonconvex stochastic optimization with quantized communi- cation,

    J. L. Chen, J. M. Wang, and J. F. Zhang, “Differentially private distributed nonconvex stochastic optimization with quantized communi- cation,” IEEE Trans. Autom. Control , doi: 10.1109/TAC.2025.3590872, 2025

  48. [48]

    Consensus seeking in multiagent systems under dynamically changing interaction topologies,

    W. Ren and R. W. Beard, “Consensus seeking in multiagent systems under dynamically changing interaction topologies,”IEEE Trans. Autom. control, vol. 50, no. 5, pp. 655–661, 2005

  49. [49]

    Convex optimization: algorithms and complexity,

    S. Bubeck, “Convex optimization: algorithms and complexity,” Found. Trends Theor. Comput. Sci., vol. 8, nos. 3-4, pp. 231–358, 2015

  50. [50]

    Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condi- tion,

    H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condi- tion,” in Proc. Mach. Learn. Knowl. Discov. Databases Euro. Conf. , Riva del Garda, Italy, 2016, pp. 795–811

  51. [51]

    The Lebesgue spaces Lp(Ω),

    R. A. Adams and J. J. F. Fournier, “The Lebesgue spaces Lp(Ω),” in Sobolev spaces, Oxford, U.K.: Academic Press, 2003, ch. 2, sec. 1, pp. 25–28

  52. [52]

    A unified theory of decentralized SGD with changing topology and local updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Int. Conf. Mach. Learn. , Vienna, Austria, 2020, pp. 5381–5393

  53. [53]

    Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,

    X. R. Lian, C. Zhang, H. Zhang, C. J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Syst. , Long Beach, CA, USA, 2017, vol. 30, pp. 5330–5340. CHEN et al.: DIFFERENTIALL Y PRIVA TE GRADIENT -TRACKING-BASE...

  54. [54]

    Sample size se- lection in optimization methods for machine learning,

    R. H. Byrd, G. M. Chin, J. Nocedal, and Y . C. Wu, “Sample size se- lection in optimization methods for machine learning,” Math. Program., vol. 134, no. 1, pp. 127–155, 2012

  55. [55]

    Don’t decay the learning rate, increase the batch size,

    S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V . Le, “Don’t decay the learning rate, increase the batch size,” in Proc. Int. Conf. Learn. Representations, Vancouver, BC, Canada, 2018, pp. 1–11

  56. [56]

    Hybrid deterministic-stochastic methods for data fitting,

    M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM J. Sci. Comput. , vol. 34, no. 3, pp. 1380–1405, 2012

  57. [57]

    Deep residual learning for image recognition,

    K. M. He, X. Y . Zhang, S. Q. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV , USA, 2016, pp. 770–778

  58. [58]

    The MNIST database of handwritten digits,

    Y . LeCun, C. Cortes, and C. J. C. Burges, 1998, “The MNIST database of handwritten digits,” National Institute of Standards and Technology. [Online]. Available: http://yann.lecun.com/exdb/mnist/

  59. [59]

    Canadian Institute for Advanced Research, 10 classes,

    A. Krizhevsky, V . Nair, and G. Hinton, 2009, “Canadian Institute for Advanced Research, 10 classes,” Department of Computer Science of University of Toronto. [Online]. Avalable: http://www.cs.toronto.edu/kriz/cifar.html

  60. [60]

    Learning multiple layers of features from tiny im- ages,

    A. Krizhevsky, “Learning multiple layers of features from tiny im- ages,” M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON, CA, 2009. [Online]. Avalable: http://www.cs.utoronto.ca/kriz/learning- features-2009-TR.pdf

  61. [61]

    Positive and nonnegative matrices,

    R. A. Horn and C. R. Johnson, “Positive and nonnegative matrices,” in Matrix analysis , Cambridge, U.K.: Cambridge University Press, 2012, ch. 8, sec. 4, pp. 529–545

  62. [62]

    Integration in a probability space,

    Y . S. Chow and H. Teicher, “Integration in a probability space,” in Probability theory: independence, interchangeability, martingales , New York, NY , USA: Springer-Verlag, 2012, ch. 4, sec. 1, pp. 84–92

  63. [63]

    Integration,

    V . A. Zorich, “Integration,” in Mathematical analysis I, Berlin, German: Springer-Verlag, 2015, ch. 6, sec. 2, pp. 349–360