pith. machine review for the scientific record. sign in

arxiv: 2604.12260 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.DC· eess.SP

Recognition: unknown

Decentralized Learning via Random Walk with Jumps

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.LG cs.DCeess.SP
keywords decentralized learningrandom walkMetropolis-HastingsLevy jumpsconvergence ratedata heterogeneitynetwork spectral gapentrapment
0
0 comments X

The pith

Adding Levy jumps to Metropolis-Hastings random walks prevents entrapment and yields explicit convergence rates in decentralized learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies token-based random-walk learning for decentralized machine learning on networks with data spread across nodes and no central server. It shows that using the Metropolis-Hastings algorithm to achieve a target sampling distribution can trap the walk inside a small subgraph, producing highly correlated model updates that slow convergence under data heterogeneity. The authors introduce occasional long-range Levy jumps into the transition rule to restore exploration while still using only local information at each node. They derive a convergence bound that isolates the separate effects of data heterogeneity, the network spectral gap, and the jump probability. Experiments confirm the jumps remove entrapment and accelerate learning compared with standard weighted random walks.

Core claim

In weighted random-walk learning over networks, implementing the desired sampling distribution through the Metropolis-Hastings algorithm can trap the walker in a small subgraph, producing correlated updates that slow convergence. Introducing occasional Levy jumps into the transition probabilities restores sufficient exploration. The resulting algorithm, Metropolis-Hastings with Levy Jumps, admits a convergence rate that makes the effects of data heterogeneity, the network's spectral gap, and the jump probability explicit and quantifiable.

What carries the argument

Metropolis-Hastings with Levy Jumps (MHLJ), a modified transition rule that adds rare long-range moves drawn from a Levy distribution to the standard Metropolis-Hastings walk while remaining based only on local information.

If this is right

  • The convergence rate improves with moderate jump probability because jumps reduce the correlation time induced by heterogeneity.
  • Networks whose spectral gap is small gain the largest relative speed-up from the jumps.
  • The method preserves the low communication and computation cost of single-token random walks.
  • The explicit rate formula lets practitioners choose jump probability to balance exploration against local update quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same jump idea could be added to other token-passing or gossip algorithms that suffer from slow mixing on sparse graphs.
  • An adaptive rule for choosing jump probability locally, perhaps based on recent visit counts, might remove the need to tune it in advance.
  • If the Levy distribution can be approximated with a small fixed set of long-range links, the approach might extend to privacy-sensitive or bandwidth-limited deployments.

Load-bearing premise

That the added Levy jumps increase mixing enough to escape local traps without biasing the sampling distribution or requiring any global network knowledge.

What would settle it

On a network with small spectral gap and high data heterogeneity, measure whether the observed convergence rate follows the predicted dependence on jump probability or whether the walk remains trapped and updates stay correlated despite the jumps.

Figures

Figures reproduced from arXiv: 2604.12260 by Matthew Dwyer, Salim El Rouayheb, Zonghong Liu.

Figure 1
Figure 1. Figure 1: Decentralized learning via random walk. The model x is carried across the network through a sequence of peer-to-peer transmissions (red arrows). Each visited node performs a local update using its private data before forwarding the model to a neighbor. Decentralized learning eliminates the dependency on a cen￾tral server by allowing devices to communicate only with their neighbors. In this work, we study d… view at source ↗
Figure 2
Figure 2. Figure 2: (a) An example of ring topology with five nodes. The [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Regression model trained on a synthetic hetero [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regression model trained on a synthetic heterogeneous data set over an Erd [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regression model trained on a synthetic heterogeneous data set over sparse networks with 1000 nodes. We compare [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) MHLJ switching to the Uniform sampling. (b) MHLJ with a decreasing jump probability. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of mixed distribution and MHLJ on ring [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

We study decentralized learning over networks where data are distributed across nodes without a central coordinator. Random walk learning is a token-based approach in which a single model is propagated across the network and updated at each visited node using local data, thereby incurring low communication and computational overheads. In weighted random-walk learning, the transition matrix is designed to achieve a desired sampling distribution, thereby speeding up convergence under data heterogeneity. We show that implementing weighted sampling via the Metropolis-Hastings algorithm can lead to a previously unexplored phenomenon we term entrapment. The random walk may become trapped in a small region of the network, resulting in highly correlated updates and severely degraded convergence. To address this issue, we propose Metropolis-Hastings with Levy jumps, which introduces occasional long-range transitions to restore exploration while respecting local information constraints. We establish a convergence rate that explicitly characterizes the roles of data heterogeneity, network spectral gap, and jump probability, and demonstrate through experiments that MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper addresses decentralized learning on networks with heterogeneous data using a token-based random walk approach. It identifies an 'entrapment' issue in Metropolis-Hastings weighted sampling that leads to poor mixing and correlated updates. To mitigate this, the authors introduce Metropolis-Hastings with Levy jumps (MHLJ), which adds occasional long-range transitions. They claim to derive a convergence rate explicitly depending on data heterogeneity, the network spectral gap, and the jump probability p, while showing via experiments that MHLJ eliminates entrapment and accelerates learning compared to baselines.

Significance. If the convergence analysis holds and the experiments are reproducible, the work would be significant for decentralized optimization: it provides a practical fix for mixing problems in token-based methods and a rate that separates the contributions of heterogeneity, topology, and the tunable jump parameter. This could inform algorithm design in communication-constrained settings where standard random walks fail under heterogeneity.

major comments (3)
  1. [§3 (Convergence Analysis), Theorem 1] §3 (Convergence Analysis), Theorem 1: The stated convergence rate is claimed to characterize the role of jump probability p, but the derivation does not specify how p is selected using only local information; if p requires knowledge of the global spectral gap or heterogeneity level to avoid biasing the stationary distribution, the bound does not apply under the paper's decentralized constraint.
  2. [§4 (Transition Kernel)] §4 (Transition Kernel): The proof that the MHLJ kernel preserves the target stationary distribution while incorporating Levy jumps is not shown in detail; without an explicit verification that the jump step maintains detailed balance using only neighbor information, the rate's validity is undermined.
  3. [Experimental section, Table 2 and Figure 3] Experimental section, Table 2 and Figure 3: The reported speed-ups and entrapment elimination are presented for fixed p values, but no sensitivity analysis or local tuning procedure for p is provided, leaving open whether the empirical gains hold when p must be chosen without global network knowledge.
minor comments (2)
  1. [Method section] The exact probability distribution for the Levy jumps (e.g., uniform over non-neighbors or power-law) should be stated explicitly in the method section to allow reproduction.
  2. [Notation] Notation for the spectral gap and heterogeneity measure should be defined consistently between the theorem statement and the experimental setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We provide point-by-point responses to the major comments below, offering clarifications and committing to revisions where they strengthen the paper.

read point-by-point responses
  1. Referee: [§3 (Convergence Analysis), Theorem 1] The stated convergence rate is claimed to characterize the role of jump probability p, but the derivation does not specify how p is selected using only local information; if p requires knowledge of the global spectral gap or heterogeneity level to avoid biasing the stationary distribution, the bound does not apply under the paper's decentralized constraint.

    Authors: We clarify that p is a user-chosen parameter that can be set locally without any global information. The convergence bound in Theorem 1 holds for any p in (0,1], and the analysis does not require p to be tuned based on the spectral gap or heterogeneity; it explicitly shows the dependence on these quantities separately from p. The stationary distribution is preserved for any such p, as the jump transitions are constructed to satisfy detailed balance using only local information. Therefore, the decentralized setting is fully respected, and the bound applies as stated. revision: no

  2. Referee: [§4 (Transition Kernel)] The proof that the MHLJ kernel preserves the target stationary distribution while incorporating Levy jumps is not shown in detail; without an explicit verification that the jump step maintains detailed balance using only neighbor information, the rate's validity is undermined.

    Authors: We agree with this observation. The main text provided a high-level description, but the detailed proof was omitted for brevity. In the revised manuscript, we will include a full proof in the appendix that explicitly verifies the preservation of the stationary distribution under the Levy jump mechanism. This proof demonstrates that detailed balance is maintained using only information from neighboring nodes, ensuring no bias is introduced. revision: yes

  3. Referee: [Experimental section, Table 2 and Figure 3] The reported speed-ups and entrapment elimination are presented for fixed p values, but no sensitivity analysis or local tuning procedure for p is provided, leaving open whether the empirical gains hold when p must be chosen without global network knowledge.

    Authors: We concur that a sensitivity study would be beneficial. We will revise the experimental section to include a sensitivity analysis over a range of p values on various network topologies. Furthermore, we will describe and evaluate a decentralized procedure for selecting p locally, for example by using local estimates of connectivity, and show that the observed improvements remain consistent. revision: yes

Circularity Check

0 steps flagged

No circularity detected in convergence analysis

full rationale

The paper derives a convergence rate for Metropolis-Hastings with Levy jumps by analyzing the transition kernel's mixing properties under data heterogeneity and network spectral gap. This rate is presented as a function of those parameters plus the jump probability, using standard Markov chain techniques on the augmented graph. No step reduces the claimed rate to a fitted quantity, self-definition, or load-bearing self-citation; the analysis treats the jump probability as an exogenous design parameter whose effect is characterized rather than presupposed. The derivation remains independent of the target empirical claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Relies on standard network mixing assumptions and introduces jump probability as a tunable parameter; Levy jumps are a new mechanism without independent evidence outside the proposal.

free parameters (1)
  • jump probability
    Controls frequency of long-range transitions; must be chosen to balance exploration and local updates.
axioms (1)
  • domain assumption Network is connected with positive spectral gap; data heterogeneity is bounded.
    Convergence rate explicitly depends on these quantities.
invented entities (1)
  • Levy jumps no independent evidence
    purpose: Occasional long-range transitions to escape local entrapment.
    New mechanism introduced to restore exploration while respecting local constraints.

pith-pipeline@v0.9.0 · 5474 in / 1152 out tokens · 54171 ms · 2026-05-10T15:17:19.890761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 4 canonical work pages

  1. [1]

    The entrapment problem in random walk decentralized learning,

    Z. Liu, S. El Rouayheb, and M. Dwyer, “The entrapment problem in random walk decentralized learning,” in2024 IEEE International Symposium on Information Theory (ISIT). IEEE, 2024, pp. 3660–3665

  2. [2]

    Parallelized stochastic gradient descent,

    M. Zinkevich, M. Weimer, L. Li, and A. Smola, “Parallelized stochastic gradient descent,”Advances in neural information processing systems, vol. 23, 2010

  3. [3]

    Parallel coordinate descent methods for big data optimization,

    P. Richt ´arik and M. Tak ´aˇc, “Parallel coordinate descent methods for big data optimization,”Mathematical Programming, vol. 156, pp. 433–484, 2016

  4. [4]

    Scaffold: Stochastic controlled averaging for federated learn- ing,

    S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learn- ing,” inInternational conference on machine learning. PMLR, 2020, pp. 5132–5143

  5. [5]

    Localnewton: Reducing communication bottleneck for distributed learning,

    V . Gupta, A. Ghosh, M. Derezinski, R. Khanna, K. Ramchandran, and M. Mahoney, “Localnewton: Reducing communication bottleneck for distributed learning,”arXiv preprint arXiv:2105.07320, 2021

  6. [6]

    The hidden vulnerability of distributed learning in byzantium,

    R. Guerraoui, S. Rouaultet al., “The hidden vulnerability of distributed learning in byzantium,” inInternational Conference on Machine Learn- ing. PMLR, 2018, pp. 3521–3530

  7. [7]

    Advances and open problems in federated learning,

    P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummingset al., “Advances and open problems in federated learning,”Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021

  8. [8]

    Randomized gossip algorithms,

    S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,”IEEE transactions on information theory, vol. 52, no. 6, pp. 2508–2530, 2006

  9. [9]

    Geographic gossip: Efficient aggregation for sensor networks,

    A. G. Dimakis, A. D. Sarwate, and M. J. Wainwright, “Geographic gossip: Efficient aggregation for sensor networks,” inProceedings of the 5th international conference on Information processing in sensor networks, 2006, pp. 69–76

  10. [10]

    Distributed subgradient methods for multi- agent optimization,

    A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009

  11. [11]

    A unified theory of decentralized sgd with changing topology and local updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized sgd with changing topology and local updates,” inInternational Conference on Machine Learning. PMLR, 2020, pp. 5381–5393

  12. [12]

    Gossip algorithms for distributed signal processing,

    A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,”Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010

  13. [13]

    Decentralized gradient tracking with local steps,

    Y . Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,”Optimization Methods and Software, vol. 40, no. 5, pp. 1153–1180, 2025

  14. [14]

    A randomized incremental subgradient method for distributed optimization in networked systems,

    B. Johansson, M. Rabi, and M. Johansson, “A randomized incremental subgradient method for distributed optimization in networked systems,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1157–1170, 2010

  15. [15]

    Private weighted random walk stochas- tic gradient descent,

    G. Ayache and S. El Rouayheb, “Private weighted random walk stochas- tic gradient descent,”IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 1, pp. 452–463, 2021

  16. [16]

    A principled framework for the design and analysis of token algorithms,

    H. Hendrikx, “A principled framework for the design and analysis of token algorithms,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2023, pp. 470–489

  17. [17]

    Stochastic gradient descent under markovian sampling schemes,

    M. Even, “Stochastic gradient descent under markovian sampling schemes,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 9412–9439

  18. [18]

    A stochastic approximation method,

    H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951

  19. [19]

    Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,

    D. P. Bertsekaset al., “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,”Optimization for Machine Learning, vol. 2010, no. 1-38, p. 3, 2011

  20. [20]

    Robust stochastic approximation approach to stochastic programming,

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009

  21. [21]

    Equation of state calculations by fast computing machines,

    N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equation of state calculations by fast computing machines,” The journal of chemical physics, vol. 21, no. 6, pp. 1087–1092, 1953

  22. [22]

    Monte carlo sampling methods using markov chains and their applications,

    W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,”Biometrika, 1970

  23. [23]

    A simple peer-to-peer algorithm for distributed optimization in sensor networks,

    B. Johansson, M. Rabi, and M. Johansson, “A simple peer-to-peer algorithm for distributed optimization in sensor networks,” in2007 46th IEEE Conference on Decision and Control. IEEE, 2007, pp. 4705– 4710

  24. [24]

    Walk for learning: A random walk approach for federated learning from heterogeneous data,

    G. Ayache, V . Dassari, and S. El Rouayheb, “Walk for learning: A random walk approach for federated learning from heterogeneous data,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 4, pp. 929–940, 2023

  25. [25]

    Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm,

    D. Needell, R. Ward, and N. Srebro, “Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm,”Advances in neural information processing systems, vol. 27, 2014

  26. [26]

    Long-range navigation on complex networks using l ´evy random walks,

    A. P. Riascos and J. L. Mateos, “Long-range navigation on complex networks using l ´evy random walks,”Physical Review E, vol. 86, no. 5, p. 056110, 2012

  27. [27]

    Information leaks in federated learning,

    A. Pustozerova and R. Mayer, “Information leaks in federated learning,” inProceedings of the network and distributed system security sympo- sium, vol. 10, 2020, p. 122

  28. [28]

    Incremental stochastic subgradient algorithms for convex optimization,

    S. S. Ram, A. Nedi ´c, and V . V . Veeravalli, “Incremental stochastic subgradient algorithms for convex optimization,”SIAM Journal on Optimization, vol. 20, no. 2, pp. 691–717, 2009

  29. [29]

    Sucag: Stochastic unbiased curvature-aided gradient method for distributed optimization,

    H.-T. Wai, N. M. Freris, A. Nedic, and A. Scaglione, “Sucag: Stochastic unbiased curvature-aided gradient method for distributed optimization,” in2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018, pp. 1751–1756

  30. [30]

    Adaptive random walk gradient descent for decentralized optimization,

    T. Sun, D. Li, and B. Wang, “Adaptive random walk gradient descent for decentralized optimization,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 20 790–20 809

  31. [31]

    Incremental adaptive strategies over distributed networks,

    C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies over distributed networks,”IEEE transactions on signal processing, vol. 55, no. 8, pp. 4064–4077, 2007

  32. [32]

    Quantized incremental algorithms for distributed optimization,

    M. G. Rabbat and R. D. Nowak, “Quantized incremental algorithms for distributed optimization,”IEEE Journal on Selected Areas in Commu- nications, vol. 23, no. 4, pp. 798–808, 2005

  33. [33]

    Asynchronous adaptation and learning over networks—part i: Modeling and stability analysis,

    X. Zhao and A. H. Sayed, “Asynchronous adaptation and learning over networks—part i: Modeling and stability analysis,”IEEE Transactions on Signal Processing, vol. 63, no. 4, pp. 811–826, 2014

  34. [34]

    Ergodic mirror descent,

    J. C. Duchi, A. Agarwal, M. Johansson, and M. I. Jordan, “Ergodic mirror descent,”SIAM Journal on Optimization, vol. 22, no. 4, pp. 1549– 1578, 2012

  35. [35]

    On markov chain gradient descent,

    T. Sun, Y . Sun, and W. Yin, “On markov chain gradient descent,” Advances in neural information processing systems, vol. 31, 2018

  36. [36]

    Adapting to mixing time in stochastic optimization with markovian data,

    R. Dorfman and K. Y . Levy, “Adapting to mixing time in stochastic optimization with markovian data,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 5429–5446

  37. [37]

    Walkman: A communication-efficient random-walk algorithm for decentralized optimization,

    X. Mao, K. Yuan, Y . Hu, Y . Gu, A. H. Sayed, and W. Yin, “Walkman: A communication-efficient random-walk algorithm for decentralized optimization,”IEEE Transactions on Signal Processing, vol. 68, pp. 2513–2528, 2020

  38. [38]

    A randomized kaczmarz algorithm with exponential convergence,

    T. Strohmer and R. Vershynin, “A randomized kaczmarz algorithm with exponential convergence,”Journal of Fourier Analysis and Applications, vol. 15, no. 2, pp. 262–278, 2009

  39. [39]

    Stochastic optimization with importance sam- pling for regularized loss minimization,

    P. Zhao and T. Zhang, “Stochastic optimization with importance sam- pling for regularized loss minimization,” ininternational conference on machine learning. PMLR, 2015, pp. 1–9

  40. [40]

    Importance sampling for minibatches,

    D. Csiba and P. Richt ´arik, “Importance sampling for minibatches,”The Journal of Machine Learning Research, vol. 19, no. 1, pp. 962–982, 2018

  41. [41]

    Measuring domain shift for deep learning in histopathology,

    K. Stacke, G. Eilertsen, J. Unger, and C. Lundstr ¨om, “Measuring domain shift for deep learning in histopathology,”IEEE journal of biomedical and health informatics, vol. 25, no. 2, pp. 325–336, 2020

  42. [42]

    A survey on heterogeneous federated learning.arXiv preprint arXiv:2210.04505,

    D. Gao, X. Yao, and Q. Yang, “A survey on heterogeneous federated learning,”arXiv preprint arXiv:2210.04505, 2022. 13

  43. [43]

    Data heterogene- ity in federated learning with electronic health records: Case studies of risk prediction for acute kidney injury and sepsis diseases in critical care,

    S. Rajendran, Z. Xu, W. Pan, A. Ghosh, and F. Wang, “Data heterogene- ity in federated learning with electronic health records: Case studies of risk prediction for acute kidney injury and sepsis diseases in critical care,”PLOS Digital Health, vol. 2, no. 3, p. e0000117, 2023

  44. [44]

    Durrett,Essentials of stochastic processes

    R. Durrett,Essentials of stochastic processes. Springer, 1999, vol. 1

  45. [45]

    Navigation in a small world,

    J. M. Kleinberg, “Navigation in a small world,”Nature, vol. 406, no. 6798, pp. 845–845, 2000

  46. [46]

    Concentration of Markov chains with bounded moments,

    A. Naor, S. Rao, and O. Regev, “Concentration of Markov chains with bounded moments,”Annales de l’Institut Henri Poincar ´e, Probabilit ´es et Statistiques, vol. 56, no. 3, pp. 2270 – 2280, 2020. [Online]. Available: https://doi.org/10.1214/19-AIHP1039

  47. [47]

    D. A. Levin and Y . Peres,Markov chains and mixing times. American Mathematical Soc., 2017, vol. 107

  48. [48]

    Perturbation of the stationary distribution measured by ergodicity coefficients,

    E. Seneta, “Perturbation of the stationary distribution measured by ergodicity coefficients,”Advances in Applied Probability, vol. 20, no. 1, pp. 228–230, 1988

  49. [49]

    Network topology and communication-computation tradeoffs in decentralized optimization,

    A. Nedi ´c, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018

  50. [50]

    Digest: Fast and communication efficient decentralized learning with local updates,

    P. Gholami and H. Seferoglu, “Digest: Fast and communication efficient decentralized learning with local updates,”IEEE Transactions on Ma- chine Learning in Communications and Networking, vol. 2, pp. 1456– 1474, 2024

  51. [51]

    Improved generalization bounds for communication efficient federated learning,

    ——, “Improved generalization bounds for communication efficient federated learning,”arXiv preprint arXiv:2404.11754, 2024