arxiv: 2604.16090 · v1 · submitted 2026-04-17 · 💻 cs.DC · cs.AI

Recognition: unknown

Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

Stefan Behfar , Richard Mortier

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:48 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords federated learningprobabilistic synchronous parallelavailability predictioncorrelated failuresnode samplingfairnessmarkov modeldistributed hash table

0 comments

The pith

AW-PSP adjusts sampling probabilities in federated learning using availability predictions to counter correlated device failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard Probabilistic Synchronous Parallel (PSP) in federated learning fails when device availability correlates with data distributions, causing some classes to be persistently under-represented. It introduces Availability-Weighted PSP (AW-PSP) as a fix that raises or lowers each node's participation chance based on live predictions, past behavior, and measured failure correlations. A Markov model separates short-term outages from long-term ones, while a distributed hash table shares the needed metadata without a central server. Trace-driven tests indicate the approach yields more even label coverage and lower fairness gaps than plain PSP even when failures are linked across devices. Readers would care because real edge networks contain mobile and battery-limited nodes whose downtime patterns often align with the data they hold.

Core claim

AW-PSP extends PSP by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient versus chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata including latency, freshness, and utility scores. Trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP.

What carries the argument

Availability-Weighted PSP (AW-PSP), a sampling protocol that multiplies each device's selection probability by its predicted availability derived from a Markov model and correlation metrics.

If this is right

Sampling becomes more robust when failures occur independently or in correlated groups.
Training data includes a wider range of labels because low-availability nodes are no longer systematically excluded.
Variance in fairness metrics across groups or classes decreases compared with standard PSP.
The protocol continues to function at large scale because metadata is kept decentralized via DHT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-reweighting pattern could be tested in other decentralized systems such as sensor networks or volunteer computing where uptime also correlates with data content.
If the predictor is replaced by a learned model that also tracks data diversity, AW-PSP might further reduce bias in non-IID federated settings.
Live experiments on actual mobile devices would reveal whether the added prediction step increases end-to-end training time enough to offset the fairness gains.

Load-bearing premise

The Markov-based availability predictor can reliably separate transient from chronic failures and that real-time predictions combined with historical data can adjust sampling probabilities without creating offsetting biases or overhead.

What would settle it

A trace where device failures are strongly correlated with specific data labels, run with AW-PSP versus plain PSP, showing no measurable rise in label coverage or drop in fairness variance.

Figures

Figures reproduced from arXiv: 2604.16090 by Richard Mortier, Stefan Behfar.

**Figure 1.** Figure 1: Distribution of device availability percentages for the trace data [22], where device availability percentage is defined as the percentage of time between the first and last times a device was seen to be live and available to perform FL, i.e., was charging and connected to Wi-Fi. Out of 1000 devices in the trace, 213 were available for <5% of the time, and over 60% were available for less than half the tim… view at source ↗

**Figure 2.** Figure 2: Comparison of accuracy for AWPSP/PSP/Oort and different number of labels per client. where Y𝑡 ⊆ Y is the set of classes covered by S𝑡 in round 𝑡. High values indicate that samples of the same class, observed across different nodes, exhibit highly inconsistent behavior. AW-PSP might increase this metric, because it selects a more diverse set of devices per round, leading to greater variability. (b) Variance… view at source ↗

**Figure 3.** Figure 3: Comparison of our 1st fairness metric for AWPSP/PSP/Oort and different number of labels per client. (a) Var(class-avg) for 2 Labels/Client (b) Var(class-avg) for 5 Labels/Client (c) Var(class-avg) for 10 Labels/Client [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of our 2nd fairness metric for AWPSP/PSP/Oort and different number of labels per client. L90 L100 M0.75 M1.0 C0.75 C1.0 R0.5 R1.0 P0.5 P1.0 25 30 Accuracy (%) AWPSP PSP Oort L90 L100 M0.75 M1.0 C0.75 C1.0 R0.5 R1.0 P0.5 P1.0 0 2 4 Avg(class-var) AWPSP PSP Oort L90 L100 M0.75 M1.0 C0.75 C1.0 R0.5 R1.0 P0.5 P1.0 1 2 Var(class-avg) AWPSP PSP Oort [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of AWPSP, PSP, and Oort to parameter perturbations across five parameter groups: latency threshold (L), communication weight (M), computation weight (C), recovery probability (R), and correlation penalty (P). This metric captures how skewed the selected data distribution is. Lower values indicate better class balance and improved representativeness. 2. Unseen classes (coverage gap). We measure… view at source ↗

**Figure 6.** Figure 6: Comparison of AWPSP/PSP/Oort in terms of KL divergence, Unseen classes (coverage gap), and Gini coefficient. (a) AW-PSP Accuracy scalability (b) AW-PSP Avg(class-var) scalability (c) AW-PSP Var(class-avg) scalability [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Scalability in terms of accuracy, fairness metric 1 and fairness metric 2 for [100,300,1000,3000] clients. distribution of selected clients closely matches the global distribution. In contrast, Classic-PSP shows larger fluctuations, while Oort frequently produces significantly higher KL values, often exceeding 0.5. This indicates that AW-PSP more effectively mitigates distribution skew, ensuring that no s… view at source ↗

read the original abstract

Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient \emph{vs} chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores. We implement AW-PSP and trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP. AW-PSP thus provides an availability-aware, and fairness-conscious node sampling protocol for FL deployments that will scale to large numbers of nodes even in heterogeneous and failure-prone environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AW-PSP adds a Markov availability predictor and DHT layer to PSP to counter correlated device failures in federated learning, but the abstract gives almost no experimental detail so the claimed gains are hard to assess.

read the letter

The paper's core move is to extend probabilistic synchronous parallel with availability-weighted sampling. It uses a first-order Markov model to predict whether a device failure is transient or chronic, folds in historical behavior and pairwise failure correlations, and stores the resulting metadata in a DHT so sampling probabilities can be adjusted without a central coordinator. The goal is to stop highly available devices from dominating rounds when availability correlates with data labels, which would otherwise leave some classes under-represented across many rounds.

Referee Report

2 major / 2 minor

Summary. The paper proposes Availability-Weighted Probabilistic Synchronous Parallel (AW-PSP), an extension of PSP for federated learning on unreliable edge devices. It incorporates a Markov-based availability predictor to distinguish transient versus chronic failures, historical behavior metrics, and a DHT layer for decentralized metadata (latency, freshness, utility) to dynamically adjust node sampling probabilities. The central claim is that this addresses co-correlation between device availability and data distribution, yielding improved robustness to independent and correlated failures, higher label coverage, and reduced fairness variance relative to standard PSP, as demonstrated via trace-driven evaluation.

Significance. If the empirical results hold under rigorous validation, the work addresses a practically important gap in FL deployments: persistent under-representation of classes or groups when availability and data are correlated. The use of real-time predictions plus DHT decentralization is a concrete algorithmic contribution that could scale to large heterogeneous systems. However, the absence of detailed evaluation protocols, baselines, and predictor accuracy metrics limits the assessed significance at present.

major comments (2)

[Abstract] Abstract: the trace-driven evaluation is asserted to demonstrate improvements in robustness, label coverage, and fairness variance, yet no details are provided on the traces employed, how correlated failures (e.g., group outages) were modeled, the exact baselines (standard PSP and any others), statistical tests, or variance across runs. This leaves the central empirical claim without load-bearing support.
[Method / Availability predictor] Markov-based availability predictor (described in the method): the first-order Markov assumption for distinguishing transient versus chronic failures is load-bearing for the dynamic re-weighting. Under non-Markovian correlations typical of real device groups (prolonged outages spanning multiple rounds), transition probabilities estimated from real-time availability and DHT metadata can systematically misclassify chronic failures, re-introducing the under-representation bias the method claims to mitigate. No separate accuracy evaluation of the predictor on non-Markovian regimes is reported.

minor comments (2)

[Title] Title contains a capitalization inconsistency ('The Face' should be 'the face').
[Abstract / Method] The abstract and method description introduce 'utility scores' without defining how they are computed or combined with availability predictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional details are needed to support the empirical claims and to clarify the predictor assumptions. We will revise the manuscript to address both major comments, expanding the evaluation description and adding analysis of the availability predictor. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the trace-driven evaluation is asserted to demonstrate improvements in robustness, label coverage, and fairness variance, yet no details are provided on the traces employed, how correlated failures (e.g., group outages) were modeled, the exact baselines (standard PSP and any others), statistical tests, or variance across runs. This leaves the central empirical claim without load-bearing support.

Authors: We agree that the abstract is high-level and omits these specifics. The full manuscript's evaluation section describes the trace-driven setup using real-world device availability traces, models correlated failures via group-based outage simulation derived from historical correlation patterns, compares against standard PSP (and implicitly other sampling strategies), and reports aggregate metrics. However, to make the support explicit and load-bearing, we will revise by adding a dedicated subsection with: precise trace sources and preprocessing, explicit modeling of correlated failures (including group outages), full list of baselines, results with means and standard deviations over multiple independent runs, and statistical significance tests. We will also update the abstract to reference these additions if space permits. revision: yes
Referee: [Method / Availability predictor] Markov-based availability predictor (described in the method): the first-order Markov assumption for distinguishing transient versus chronic failures is load-bearing for the dynamic re-weighting. Under non-Markovian correlations typical of real device groups (prolonged outages spanning multiple rounds), transition probabilities estimated from real-time availability and DHT metadata can systematically misclassify chronic failures, re-introducing the under-representation bias the method claims to mitigate. No separate accuracy evaluation of the predictor on non-Markovian regimes is reported.

Authors: The first-order Markov model is adopted for its low overhead and ability to update predictions in real time from DHT metadata while distinguishing short-term transient failures from longer-term patterns. Our trace-driven results already incorporate real device traces that include prolonged and correlated outages, and AW-PSP demonstrates improved robustness and fairness under these conditions, indicating practical resilience. We nevertheless acknowledge that no isolated accuracy study of the predictor under explicitly non-Markovian regimes (e.g., higher-order dependencies or synthetic long-memory failure processes) is provided. We will add such an evaluation in the revision, including an ablation comparing predictor accuracy and downstream FL metrics on both Markovian and non-Markovian trace subsets. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces AW-PSP as an algorithmic extension to PSP that incorporates a Markov-based availability predictor, historical metrics, and DHT metadata to adjust sampling probabilities. All central claims of improved robustness, label coverage, and fairness are supported exclusively by trace-driven evaluation on external data rather than by any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations appear in the provided text, and the method is presented as a practical protocol whose correctness is assessed against independent benchmarks, rendering the argument self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The approach implicitly relies on the existence of accurate availability predictors and measurable failure correlations without providing independent evidence for those components.

invented entities (2)

Availability-Weighted PSP (AW-PSP) no independent evidence
purpose: Adjust node sampling probabilities using availability predictions and failure correlations
New protocol introduced to address PSP limitations under correlated failures
Markov-based availability predictor no independent evidence
purpose: Distinguish transient versus chronic device failures
Component proposed to enable dynamic weighting

pith-pipeline@v0.9.0 · 5578 in / 1229 out tokens · 47214 ms · 2026-05-10T07:48:14.307189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 13 canonical work pages · 2 internal anchors

[1]

A bridging model for parallel computation,

L. G. Valiant, “A bridging model for parallel computation, ”Communi- cations of the ACM, vol. 33, no. 8, pp. 103–111, 1990

1990
[2]

Sharper convergence guaran- tees for asynchronous SGD for distributed and federated learning,

A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guaran- tees for asynchronous SGD for distributed and federated learning, ” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/ 6db3ea527f53682657b3d6b02a841340-Paper-Conference.pdf

2022
[3]

More effective distributed ml via a stale synchronous parallel parameter server,

Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server, ” inAdvances in Neural Information Processing Systems, 2013, pp. 1223–1231. [Online]. Available: https://dl.acm.org/doi/10.5555/2999611.2999748

work page doi:10.5555/2999611.2999748 2013
[4]

Gradient coding: Avoiding stragglers in distributed learning,

R. Tandon, Q. Lei, A. G. Dimakis, and A. Karbasi, “Gradient coding: Avoiding stragglers in distributed learning, ” inProceedings of the 34th International Conference on Machine Learning, 2017, pp. 3368–3376. [Online]. Available: https://dl.acm.org/doi/pdf/10.5555/ 3305890.3306029

work page arXiv 2017
[5]

Sageflow: Robust federated learning against both stragglers and adversaries,

J. Park, D.-J. Han, M. Choi, and J. Moon, “Sageflow: Robust federated learning against both stragglers and adversaries, ” inAdvances in Neural Information Processing Systems, 2021. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/ 076a8133735eb5d7552dc195b125a454-Paper.pdf

2021
[6]

Fluid: Mitigating strag- glers in federated learning using invariant dropout,

I. Wang, P. J. Nair, and D. Mahajan, “Fluid: Mitigating strag- glers in federated learning using invariant dropout, ” inAd- vances in Neural Information Processing Systems, 2023. [On- line]. Available: https://papers.neurips.cc/paper_files/paper/2023/ file/e7feb9dbd9a94b6c552fc403fcebf2ef-Paper-Conference.pdf

2023
[7]

Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,

D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, H. L. Kwing, T. Parcollet, P. P. de Gusmão, and N. D. Lane, “Flower: A friendly federated learning research framework, ”preprint arXiv:2007.14390, 2020

work page arXiv 2007
[8]

Flame: Simplifying topology extension in federated learning,

H. Daga, J. Shin, D. Garg, A. Gavrilovska, M. Lee, and R. R. Kom- pella, “Flame: Simplifying topology extension in federated learning, ” inProceedings of the ACM Symposium on Cloud Computing (SoCC ’23), 2023

2023
[9]

Fedscale: Benchmarking model and system perfor- mance of federated learning at scale,

F. Lai, Y. Dai, S. Singapuram, J. Liu, X. Zhu, H. Madhyastha, and M. Chowdhury, “Fedscale: Benchmarking model and system perfor- mance of federated learning at scale, ” inInternational Conference on Machine Learning. PMLR, 2022, pp. 11 814–11 827

2022
[10]

Oort: Efficient federated learning via guided participant selection,

F. Lai, X. Zhu, H. V. Madhyastha, and M. Chowdhury, “Oort: Efficient federated learning via guided participant selection, ” in15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021, pp. 19–35

2021
[11]

To- wards federated learning at scale: System design,

K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečny, S. Mazzocchi, and B. M. et al., “To- wards federated learning at scale: System design, ”Proceedings of Ma- chine Learning and Systems, vol. 1, pp. 374–388, 2019

2019
[12]

Flint: A platform for federated learning integration,

E. Wang, B. Chen, M. Chowdhury, A. Kannan, and F. Liang, “Flint: A platform for federated learning integration, ”Proceedings of Machine Learning and Systems, vol. 5, 2023

2023
[13]

Client availability in federated learning: It matters!

D. Garg, D. Sanyal, M. Lee, A. Tumanov, and A. Gavrilovska, “Client availability in federated learning: It matters!”EuroMLSys ’25, Rotterdam, Netherlands, 2025. [Online]. Available: https://dl.acm.org/ doi/pdf/10.1145/3721146.3721964

work page doi:10.1145/3721146.3721964 2025
[15]

Available: https://arxiv.org/abs/1709.07772

[Online]. Available: https://arxiv.org/abs/1709.07772

work page arXiv
[16]

Communication- efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, and S. Hampson, “Communication- efficient learning of deep networks from decentralized data, ” inAIS- TATS, 2017

2017
[17]

Optimizing federated learning on non-iid data with reinforcement learning,

H. Wang and et al., “Optimizing federated learning on non-iid data with reinforcement learning, ” inIEEE Annual Joint Conference: INFOCOM, IEEE Computer and Communications Societies, 2020

2020
[18]

Vt-mininet: Virtual-time-enabled mininet for scalable and accurate software-defined network emulation,

J. Yan and D. Jin, “Vt-mininet: Virtual-time-enabled mininet for scalable and accurate software-defined network emulation, ” inProceedings of the 2nd ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR), 2015, pp. 27:1–27:7

2015
[19]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

2016
[20]

Learning multiple layers of features from tiny images,

A. K. et al., “Learning multiple layers of features from tiny images, ” University of Toronto, Tech. Rep., 2009, cifar-10 dataset. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html

2009
[21]

Cross-Entropy loss,

“Cross-Entropy loss, ” Wikipedia, https://en.wikipedia.org/wiki/Cross- entropy, accessed July 2025

2025
[22]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization, ”arXiv preprint arXiv:1412.6980, 2014. [Online]. Available: https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data,

C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, “Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data, ” inProceedings of the Web Conference 2021. ACM, 2021, pp. 935–946. [Online]. Available: https://dl.acm.org/doi/10.1145/3442381.3449851

work page doi:10.1145/3442381.3449851 2021
[24]

Federated Learning with Non-IID Data

Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data, ”arXiv preprint arXiv:1806.00582, 2018. [Online]. Available: https://arxiv.org/abs/1806.00582

work page internal anchor Pith review arXiv 2018
[25]

Fedmd: Heterogeneous federated learning via model distillation,

D. Li, J. Hu, and Y. Wang, “Fedmd: Heterogeneous federated learning via model distillation, ” inNeurIPS Workshop on Federated Learning, 2019

2019
[26]

Personalized federated learn- ing with theoretical guarantees: A model-agnostic meta-learning ap- proach,

A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learn- ing with theoretical guarantees: A model-agnostic meta-learning ap- proach, ” inNeurIPS, 2020

2020
[27]

Federated multi- task learning,

V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi- task learning, ” inNeurIPS, 2017

2017
[28]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks, ” inMLSYS, 2020

2020
[29]

Fair resource allocation in federated learning,

T. Li, M. Sanjabi, A. Beirami, and V. Smith, “Fair resource allocation in federated learning, ” inICLR, 2020

2020
[30]

Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy con- straints,

F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy con- straints, ” inIEEE Transactions on Neural Networks and Learning Systems, 2020

2020
[31]

Fair federated learning framework with adaptive regularization,

J. Pei, “Fair federated learning framework with adaptive regularization, ” Knowledge-Based Systems, vol. 316, p. 113392, 2025, received 16 Oct 2024; Revised 8 Mar 2025; Accepted 18 Mar 2025; Available online 27 Mar 2025; Version of Record 1 Apr 2025. [Online]. Available: https://doi.org/10.1016/j.knosys.2025.113392

work page doi:10.1016/j.knosys.2025.113392 2025
[32]

Fedga: A fair federated learning framework based on the gini coefficient,

S. Liu, “Fedga: A fair federated learning framework based on the gini coefficient, ”arXiv preprint arXiv:2507.12983, 2025, accepted for publication in Transactions on Machine Learning Research (TMLR). [Online]. Available: https://arxiv.org/abs/2507.12983

work page arXiv 2025
[33]

arXiv preprint arXiv:2111.01872 , year=

Y. Shi, H. Yu, and C. Leung, “Towards fairness-aware federated learning, ”arXiv preprint arXiv:2111.01872, 2021. [Online]. Available: https://arxiv.org/abs/2111.01872

work page arXiv 2021
[34]

Fedgs: Federated graph-based sampling with arbitrary client availability,

Z. Wang, X. Fan, J. Qi, H. Jin, P. Yang, S. Shen, and C. Wang, “Fedgs: Federated graph-based sampling with arbitrary client availability, ” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 10 271–10 278

2023
[35]

Federated learning under heterogeneous and correlated client availability,

A. Rodio, F. Faticanti, O. Marfoq, G. Neglia, and E. Leonardi, “Federated learning under heterogeneous and correlated client availability, ”IEEE/ACM Transactions on Networking, pp. 1–10, 2023. [Online]. Available: https://arxiv.org/abs/2301.04632

work page arXiv 2023
[36]

Federated learning under intermittent client availability and time-varying communication constraints,

M. Ribero, H. Vikalo, and G. De Veciana, “Federated learning under intermittent client availability and time-varying communication constraints, ”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 3, pp. 403–418, 2022. [Online]. Available: https: //arxiv.org/abs/2205.06730 13 A Theoretical Justification and Guarantees for the Model in Secti...

work page arXiv 2022
[37]

(Recovery Approximation Accuracy) If recovery proba- bility is estimated using 𝛽𝑖 (𝑡)=1−𝑒 −𝜆𝑖 (𝑡)Δ (60) then for smallΔ: |𝜌𝑖 (𝑡) −𝛽 𝑖 (𝑡) | ≤ 1 2 Λ2 𝑖 Δ2 (61) whereΛ 𝑖 =sup 𝑡 𝜆𝑖 (𝑡). 15 Then the participation estimator satisfies: 𝑝 ′ 𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅 𝑖 (𝑡) +𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡) 1 2 Λ2 𝑖 Δ2 (62) In particular, if𝜅 𝑖 (𝑡) →0andΔ→0, then 𝑝 ′ 𝑖 (...
[38]

(Bounded dependence of resources) There exists 𝜅𝑖 (𝑡) such that P(E 𝑖 (𝑡)) −𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡) ≤𝜅 𝑖 (𝑡)(88)
[39]

(Correlated failure upper bound) The correlation penalty 𝜌𝑖 (𝑡)satisfies: P(𝑍 𝑖 (𝑡)=1| E 𝑖 (𝑡)) ≤𝜌 𝑖 (𝑡) +𝜖 𝜌 𝑖 (𝑡)(89) for some 𝜖 𝜌 𝑖 (𝑡) ≥ 0capturing mismatch between the penalty model and the true conditional failure risk
[40]

(Recovery approximation) The recovery estimator satis- fies: |P(𝑍 𝑖 (𝑡+1)=0|𝑍 𝑖 (𝑡)=1) −𝛽 𝑖 (𝑡) | ≤𝜖 𝛽 𝑖 (𝑡)(90) Then, for each round𝑡, ˜𝑝𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅 𝑖 (𝑡)+𝑝 𝜖 𝜌 𝑖 (𝑡)+𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡)𝜌 𝑖 (𝑡)𝜅 𝑖 (𝑡) (91) Moreover, incorporating recovery yields 𝑝 ′ 𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅𝑖 (𝑡) +𝑝 𝜖 𝜌 𝑖 (𝑡) +𝑝 𝜖 𝛽 𝑖 (𝑡) +𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡)𝜌 𝑖...