Recognition: unknown
Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
Pith reviewed 2026-05-10 07:48 UTC · model grok-4.3
The pith
AW-PSP adjusts sampling probabilities in federated learning using availability predictions to counter correlated device failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AW-PSP extends PSP by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient versus chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata including latency, freshness, and utility scores. Trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP.
What carries the argument
Availability-Weighted PSP (AW-PSP), a sampling protocol that multiplies each device's selection probability by its predicted availability derived from a Markov model and correlation metrics.
If this is right
- Sampling becomes more robust when failures occur independently or in correlated groups.
- Training data includes a wider range of labels because low-availability nodes are no longer systematically excluded.
- Variance in fairness metrics across groups or classes decreases compared with standard PSP.
- The protocol continues to function at large scale because metadata is kept decentralized via DHT.
Where Pith is reading between the lines
- The same prediction-plus-reweighting pattern could be tested in other decentralized systems such as sensor networks or volunteer computing where uptime also correlates with data content.
- If the predictor is replaced by a learned model that also tracks data diversity, AW-PSP might further reduce bias in non-IID federated settings.
- Live experiments on actual mobile devices would reveal whether the added prediction step increases end-to-end training time enough to offset the fairness gains.
Load-bearing premise
The Markov-based availability predictor can reliably separate transient from chronic failures and that real-time predictions combined with historical data can adjust sampling probabilities without creating offsetting biases or overhead.
What would settle it
A trace where device failures are strongly correlated with specific data labels, run with AW-PSP versus plain PSP, showing no measurable rise in label coverage or drop in fairness variance.
Figures
read the original abstract
Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient \emph{vs} chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores. We implement AW-PSP and trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP. AW-PSP thus provides an availability-aware, and fairness-conscious node sampling protocol for FL deployments that will scale to large numbers of nodes even in heterogeneous and failure-prone environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Availability-Weighted Probabilistic Synchronous Parallel (AW-PSP), an extension of PSP for federated learning on unreliable edge devices. It incorporates a Markov-based availability predictor to distinguish transient versus chronic failures, historical behavior metrics, and a DHT layer for decentralized metadata (latency, freshness, utility) to dynamically adjust node sampling probabilities. The central claim is that this addresses co-correlation between device availability and data distribution, yielding improved robustness to independent and correlated failures, higher label coverage, and reduced fairness variance relative to standard PSP, as demonstrated via trace-driven evaluation.
Significance. If the empirical results hold under rigorous validation, the work addresses a practically important gap in FL deployments: persistent under-representation of classes or groups when availability and data are correlated. The use of real-time predictions plus DHT decentralization is a concrete algorithmic contribution that could scale to large heterogeneous systems. However, the absence of detailed evaluation protocols, baselines, and predictor accuracy metrics limits the assessed significance at present.
major comments (2)
- [Abstract] Abstract: the trace-driven evaluation is asserted to demonstrate improvements in robustness, label coverage, and fairness variance, yet no details are provided on the traces employed, how correlated failures (e.g., group outages) were modeled, the exact baselines (standard PSP and any others), statistical tests, or variance across runs. This leaves the central empirical claim without load-bearing support.
- [Method / Availability predictor] Markov-based availability predictor (described in the method): the first-order Markov assumption for distinguishing transient versus chronic failures is load-bearing for the dynamic re-weighting. Under non-Markovian correlations typical of real device groups (prolonged outages spanning multiple rounds), transition probabilities estimated from real-time availability and DHT metadata can systematically misclassify chronic failures, re-introducing the under-representation bias the method claims to mitigate. No separate accuracy evaluation of the predictor on non-Markovian regimes is reported.
minor comments (2)
- [Title] Title contains a capitalization inconsistency ('The Face' should be 'the face').
- [Abstract / Method] The abstract and method description introduce 'utility scores' without defining how they are computed or combined with availability predictions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that additional details are needed to support the empirical claims and to clarify the predictor assumptions. We will revise the manuscript to address both major comments, expanding the evaluation description and adding analysis of the availability predictor. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the trace-driven evaluation is asserted to demonstrate improvements in robustness, label coverage, and fairness variance, yet no details are provided on the traces employed, how correlated failures (e.g., group outages) were modeled, the exact baselines (standard PSP and any others), statistical tests, or variance across runs. This leaves the central empirical claim without load-bearing support.
Authors: We agree that the abstract is high-level and omits these specifics. The full manuscript's evaluation section describes the trace-driven setup using real-world device availability traces, models correlated failures via group-based outage simulation derived from historical correlation patterns, compares against standard PSP (and implicitly other sampling strategies), and reports aggregate metrics. However, to make the support explicit and load-bearing, we will revise by adding a dedicated subsection with: precise trace sources and preprocessing, explicit modeling of correlated failures (including group outages), full list of baselines, results with means and standard deviations over multiple independent runs, and statistical significance tests. We will also update the abstract to reference these additions if space permits. revision: yes
-
Referee: [Method / Availability predictor] Markov-based availability predictor (described in the method): the first-order Markov assumption for distinguishing transient versus chronic failures is load-bearing for the dynamic re-weighting. Under non-Markovian correlations typical of real device groups (prolonged outages spanning multiple rounds), transition probabilities estimated from real-time availability and DHT metadata can systematically misclassify chronic failures, re-introducing the under-representation bias the method claims to mitigate. No separate accuracy evaluation of the predictor on non-Markovian regimes is reported.
Authors: The first-order Markov model is adopted for its low overhead and ability to update predictions in real time from DHT metadata while distinguishing short-term transient failures from longer-term patterns. Our trace-driven results already incorporate real device traces that include prolonged and correlated outages, and AW-PSP demonstrates improved robustness and fairness under these conditions, indicating practical resilience. We nevertheless acknowledge that no isolated accuracy study of the predictor under explicitly non-Markovian regimes (e.g., higher-order dependencies or synthetic long-memory failure processes) is provided. We will add such an evaluation in the revision, including an ablation comparing predictor accuracy and downstream FL metrics on both Markovian and non-Markovian trace subsets. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces AW-PSP as an algorithmic extension to PSP that incorporates a Markov-based availability predictor, historical metrics, and DHT metadata to adjust sampling probabilities. All central claims of improved robustness, label coverage, and fairness are supported exclusively by trace-driven evaluation on external data rather than by any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations appear in the provided text, and the method is presented as a practical protocol whose correctness is assessed against independent benchmarks, rendering the argument self-contained.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Availability-Weighted PSP (AW-PSP)
no independent evidence
-
Markov-based availability predictor
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A bridging model for parallel computation,
L. G. Valiant, “A bridging model for parallel computation, ”Communi- cations of the ACM, vol. 33, no. 8, pp. 103–111, 1990
1990
-
[2]
Sharper convergence guaran- tees for asynchronous SGD for distributed and federated learning,
A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guaran- tees for asynchronous SGD for distributed and federated learning, ” inAdvances in Neural Information Processing Systems, 2022. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/ 6db3ea527f53682657b3d6b02a841340-Paper-Conference.pdf
2022
-
[3]
More effective distributed ml via a stale synchronous parallel parameter server,
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server, ” inAdvances in Neural Information Processing Systems, 2013, pp. 1223–1231. [Online]. Available: https://dl.acm.org/doi/10.5555/2999611.2999748
-
[4]
Gradient coding: Avoiding stragglers in distributed learning,
R. Tandon, Q. Lei, A. G. Dimakis, and A. Karbasi, “Gradient coding: Avoiding stragglers in distributed learning, ” inProceedings of the 34th International Conference on Machine Learning, 2017, pp. 3368–3376. [Online]. Available: https://dl.acm.org/doi/pdf/10.5555/ 3305890.3306029
-
[5]
Sageflow: Robust federated learning against both stragglers and adversaries,
J. Park, D.-J. Han, M. Choi, and J. Moon, “Sageflow: Robust federated learning against both stragglers and adversaries, ” inAdvances in Neural Information Processing Systems, 2021. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/ 076a8133735eb5d7552dc195b125a454-Paper.pdf
2021
-
[6]
Fluid: Mitigating strag- glers in federated learning using invariant dropout,
I. Wang, P. J. Nair, and D. Mahajan, “Fluid: Mitigating strag- glers in federated learning using invariant dropout, ” inAd- vances in Neural Information Processing Systems, 2023. [On- line]. Available: https://papers.neurips.cc/paper_files/paper/2023/ file/e7feb9dbd9a94b6c552fc403fcebf2ef-Paper-Conference.pdf
2023
-
[7]
Flower: A friendly federated learning research framework.arXiv preprint arXiv:2007.14390,
D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, H. L. Kwing, T. Parcollet, P. P. de Gusmão, and N. D. Lane, “Flower: A friendly federated learning research framework, ”preprint arXiv:2007.14390, 2020
-
[8]
Flame: Simplifying topology extension in federated learning,
H. Daga, J. Shin, D. Garg, A. Gavrilovska, M. Lee, and R. R. Kom- pella, “Flame: Simplifying topology extension in federated learning, ” inProceedings of the ACM Symposium on Cloud Computing (SoCC ’23), 2023
2023
-
[9]
Fedscale: Benchmarking model and system perfor- mance of federated learning at scale,
F. Lai, Y. Dai, S. Singapuram, J. Liu, X. Zhu, H. Madhyastha, and M. Chowdhury, “Fedscale: Benchmarking model and system perfor- mance of federated learning at scale, ” inInternational Conference on Machine Learning. PMLR, 2022, pp. 11 814–11 827
2022
-
[10]
Oort: Efficient federated learning via guided participant selection,
F. Lai, X. Zhu, H. V. Madhyastha, and M. Chowdhury, “Oort: Efficient federated learning via guided participant selection, ” in15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), 2021, pp. 19–35
2021
-
[11]
To- wards federated learning at scale: System design,
K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečny, S. Mazzocchi, and B. M. et al., “To- wards federated learning at scale: System design, ”Proceedings of Ma- chine Learning and Systems, vol. 1, pp. 374–388, 2019
2019
-
[12]
Flint: A platform for federated learning integration,
E. Wang, B. Chen, M. Chowdhury, A. Kannan, and F. Liang, “Flint: A platform for federated learning integration, ”Proceedings of Machine Learning and Systems, vol. 5, 2023
2023
-
[13]
Client availability in federated learning: It matters!
D. Garg, D. Sanyal, M. Lee, A. Tumanov, and A. Gavrilovska, “Client availability in federated learning: It matters!”EuroMLSys ’25, Rotterdam, Netherlands, 2025. [Online]. Available: https://dl.acm.org/ doi/pdf/10.1145/3721146.3721964
-
[15]
Available: https://arxiv.org/abs/1709.07772
[Online]. Available: https://arxiv.org/abs/1709.07772
-
[16]
Communication- efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, and S. Hampson, “Communication- efficient learning of deep networks from decentralized data, ” inAIS- TATS, 2017
2017
-
[17]
Optimizing federated learning on non-iid data with reinforcement learning,
H. Wang and et al., “Optimizing federated learning on non-iid data with reinforcement learning, ” inIEEE Annual Joint Conference: INFOCOM, IEEE Computer and Communications Societies, 2020
2020
-
[18]
Vt-mininet: Virtual-time-enabled mininet for scalable and accurate software-defined network emulation,
J. Yan and D. Jin, “Vt-mininet: Virtual-time-enabled mininet for scalable and accurate software-defined network emulation, ” inProceedings of the 2nd ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR), 2015, pp. 27:1–27:7
2015
-
[19]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
2016
-
[20]
Learning multiple layers of features from tiny images,
A. K. et al., “Learning multiple layers of features from tiny images, ” University of Toronto, Tech. Rep., 2009, cifar-10 dataset. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
2009
-
[21]
Cross-Entropy loss,
“Cross-Entropy loss, ” Wikipedia, https://en.wikipedia.org/wiki/Cross- entropy, accessed July 2025
2025
-
[22]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization, ”arXiv preprint arXiv:1412.6980, 2014. [Online]. Available: https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data,
C. Yang, Q. Wang, M. Xu, Z. Chen, K. Bian, Y. Liu, and X. Liu, “Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data, ” inProceedings of the Web Conference 2021. ACM, 2021, pp. 935–946. [Online]. Available: https://dl.acm.org/doi/10.1145/3442381.3449851
-
[24]
Federated Learning with Non-IID Data
Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data, ”arXiv preprint arXiv:1806.00582, 2018. [Online]. Available: https://arxiv.org/abs/1806.00582
work page internal anchor Pith review arXiv 2018
-
[25]
Fedmd: Heterogeneous federated learning via model distillation,
D. Li, J. Hu, and Y. Wang, “Fedmd: Heterogeneous federated learning via model distillation, ” inNeurIPS Workshop on Federated Learning, 2019
2019
-
[26]
Personalized federated learn- ing with theoretical guarantees: A model-agnostic meta-learning ap- proach,
A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learn- ing with theoretical guarantees: A model-agnostic meta-learning ap- proach, ” inNeurIPS, 2020
2020
-
[27]
Federated multi- task learning,
V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated multi- task learning, ” inNeurIPS, 2017
2017
-
[28]
Federated optimization in heterogeneous networks,
T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks, ” inMLSYS, 2020
2020
-
[29]
Fair resource allocation in federated learning,
T. Li, M. Sanjabi, A. Beirami, and V. Smith, “Fair resource allocation in federated learning, ” inICLR, 2020
2020
-
[30]
Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy con- straints,
F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multi-task optimization under privacy con- straints, ” inIEEE Transactions on Neural Networks and Learning Systems, 2020
2020
-
[31]
Fair federated learning framework with adaptive regularization,
J. Pei, “Fair federated learning framework with adaptive regularization, ” Knowledge-Based Systems, vol. 316, p. 113392, 2025, received 16 Oct 2024; Revised 8 Mar 2025; Accepted 18 Mar 2025; Available online 27 Mar 2025; Version of Record 1 Apr 2025. [Online]. Available: https://doi.org/10.1016/j.knosys.2025.113392
-
[32]
Fedga: A fair federated learning framework based on the gini coefficient,
S. Liu, “Fedga: A fair federated learning framework based on the gini coefficient, ”arXiv preprint arXiv:2507.12983, 2025, accepted for publication in Transactions on Machine Learning Research (TMLR). [Online]. Available: https://arxiv.org/abs/2507.12983
-
[33]
arXiv preprint arXiv:2111.01872 , year=
Y. Shi, H. Yu, and C. Leung, “Towards fairness-aware federated learning, ”arXiv preprint arXiv:2111.01872, 2021. [Online]. Available: https://arxiv.org/abs/2111.01872
-
[34]
Fedgs: Federated graph-based sampling with arbitrary client availability,
Z. Wang, X. Fan, J. Qi, H. Jin, P. Yang, S. Shen, and C. Wang, “Fedgs: Federated graph-based sampling with arbitrary client availability, ” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 8, 2023, pp. 10 271–10 278
2023
-
[35]
Federated learning under heterogeneous and correlated client availability,
A. Rodio, F. Faticanti, O. Marfoq, G. Neglia, and E. Leonardi, “Federated learning under heterogeneous and correlated client availability, ”IEEE/ACM Transactions on Networking, pp. 1–10, 2023. [Online]. Available: https://arxiv.org/abs/2301.04632
-
[36]
M. Ribero, H. Vikalo, and G. De Veciana, “Federated learning under intermittent client availability and time-varying communication constraints, ”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 3, pp. 403–418, 2022. [Online]. Available: https: //arxiv.org/abs/2205.06730 13 A Theoretical Justification and Guarantees for the Model in Secti...
-
[37]
(Recovery Approximation Accuracy) If recovery proba- bility is estimated using 𝛽𝑖 (𝑡)=1−𝑒 −𝜆𝑖 (𝑡)Δ (60) then for smallΔ: |𝜌𝑖 (𝑡) −𝛽 𝑖 (𝑡) | ≤ 1 2 Λ2 𝑖 Δ2 (61) whereΛ 𝑖 =sup 𝑡 𝜆𝑖 (𝑡). 15 Then the participation estimator satisfies: 𝑝 ′ 𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅 𝑖 (𝑡) +𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡) 1 2 Λ2 𝑖 Δ2 (62) In particular, if𝜅 𝑖 (𝑡) →0andΔ→0, then 𝑝 ′ 𝑖 (...
-
[38]
(Bounded dependence of resources) There exists 𝜅𝑖 (𝑡) such that P(E 𝑖 (𝑡)) −𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡) ≤𝜅 𝑖 (𝑡)(88)
-
[39]
(Correlated failure upper bound) The correlation penalty 𝜌𝑖 (𝑡)satisfies: P(𝑍 𝑖 (𝑡)=1| E 𝑖 (𝑡)) ≤𝜌 𝑖 (𝑡) +𝜖 𝜌 𝑖 (𝑡)(89) for some 𝜖 𝜌 𝑖 (𝑡) ≥ 0capturing mismatch between the penalty model and the true conditional failure risk
-
[40]
(Recovery approximation) The recovery estimator satis- fies: |P(𝑍 𝑖 (𝑡+1)=0|𝑍 𝑖 (𝑡)=1) −𝛽 𝑖 (𝑡) | ≤𝜖 𝛽 𝑖 (𝑡)(90) Then, for each round𝑡, ˜𝑝𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅 𝑖 (𝑡)+𝑝 𝜖 𝜌 𝑖 (𝑡)+𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡)𝜌 𝑖 (𝑡)𝜅 𝑖 (𝑡) (91) Moreover, incorporating recovery yields 𝑝 ′ 𝑖 (𝑡) −𝑝 true 𝑖 (𝑡) ≤𝑝 𝜅𝑖 (𝑡) +𝑝 𝜖 𝜌 𝑖 (𝑡) +𝑝 𝜖 𝛽 𝑖 (𝑡) +𝑝 𝑎 comp 𝑖 (𝑡)𝑎 comm 𝑖 (𝑡)𝜌 𝑖...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.