SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
Pith reviewed 2026-05-18 08:51 UTC · model grok-4.3
The pith
SketchGuard uses Count Sketches to screen neighbors first, so rejected Byzantine nodes cost only O(k) instead of O(d) in decentralized federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SketchGuard decouples Byzantine filtering from aggregation by exchanging only k-dimensional Count Sketches for neighbor screening and fetching full d-dimensional models solely from accepted neighbors. Count Sketch's distance-preservation property ensures that the sketch-based filter deviates from a full-precision filter by at most a (1 + O(ε)) factor in the effective threshold, which can be driven arbitrarily close to 1. This yields communication savings that grow with the Byzantine rejection rate, reaching 50-70 percent when half or more neighbors are malicious, while matching state-of-the-art robustness across multiple topologies and attack types.
What carries the argument
Count Sketch distance-preservation guarantee applied to neighbor screening before full-model aggregation.
If this is right
- Communication savings scale directly with the fraction of rejected neighbors and become negligible only when almost all neighbors are honest.
- Convergence holds in both strongly convex and non-convex regimes with an arbitrarily small gap to the exact-threshold case.
- Robustness remains stable across five network topologies and four attack types even when sketches are compressed by factors up to 13,000:1.
- Computation drops by as much as 82 percent relative to full-precision robust baselines while mean TER deviation stays at or below 0.5 percentage points.
Where Pith is reading between the lines
- The same screening-before-fetch pattern could reduce overhead in any distance-based defense that currently exchanges full vectors with all participants.
- For very large models the relative savings would grow further because the fixed sketch size k stays independent of dimension d.
- Choosing sketch size to target a specific ε lets practitioners trade a quantifiable robustness gap for predictable communication reduction without retraining the defense logic.
Load-bearing premise
The distance-preservation guarantee of Count Sketch ensures that sketch-based filtering deviates from full-precision filtering by at most a (1+O(ε)) factor in the effective threshold.
What would settle it
An experiment or counterexample in which the observed deviation between sketch-based and full-precision filtering thresholds exceeds the claimed (1 + O(ε)) bound for a chosen sketch size, or in which SketchGuard's test error rate on one of the three non-IID benchmarks deviates by more than 0.5 percentage points from the full-precision baseline under any of the four attack types.
Figures
read the original abstract
Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers but remains vulnerable to Byzantine attacks. Existing Byzantine-robust defenses are predicated on exchanging full, high-dimensional model vectors with every neighbor before filtering, an $O(d|\mathcal{N}_i|)$ communication cost incurred regardless of how many neighbors are ultimately rejected. This design choice is sustainable in small-scale experimental settings but becomes a fundamental barrier to deployment as network scale or model size grows. We propose SketchGuard, a framework that decouples Byzantine filtering from aggregation via sketch-based screening. Each client compresses its $d$-dimensional model to a $k$-dimensional Count Sketch ($k \ll d$), exchanges only sketches for neighbor screening, and fetches full models exclusively from accepted neighbors. This eliminates the pre-filtering communication waste of existing defenses: rejected Byzantine neighbors incur only $O(k)$ sketch cost rather than $O(d)$ full-model cost. Communication savings therefore scale with the Byzantine rejection rate: negligible extra overhead in benign conditions, rising to 50-70% total savings when 50-70% of neighbors are rejected. We prove convergence in both strongly convex and non-convex settings, establishing that Count Sketch's distance-preservation guarantee causes sketch-based filtering to deviate from full-precision filtering by at most a $(1+O(\epsilon))$ factor in the effective threshold, a gap that can be made arbitrarily small. Experiments across three non-IID federated benchmarks, five network topologies, and four attack types confirm that SketchGuard matches state-of-the-art robustness (mean TER deviation $\leq$0.5 percentage points) while reducing computation by up to 82%, with robustness remaining stable across compression ratios up to 13,000:1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SketchGuard for Byzantine-robust decentralized federated learning. It decouples neighbor screening from aggregation by exchanging k-dimensional Count Sketches (k ≪ d) instead of full d-dimensional models, incurring O(k) cost only for rejected Byzantine neighbors. The authors prove convergence for both strongly convex and non-convex objectives, showing that the sketch-based filter deviates from full-precision filtering by at most a (1+O(ε)) factor in the effective threshold. Experiments across three non-IID benchmarks, five topologies, and four attack types report matching SOTA robustness (mean TER deviation ≤0.5 pp) with up to 82% computation reduction and stability at compression ratios up to 13,000:1.
Significance. If the guarantees hold, the work removes a fundamental scalability barrier in DFL by making pre-filter communication cost proportional to the rejection rate rather than fixed at O(d|N_i|). Reliance on standard, externally established Count Sketch distance-preservation properties yields a clean, largely parameter-free derivation. The breadth of experimental validation (multiple benchmarks, topologies, and attacks) supports practical relevance for large-scale deployments.
major comments (1)
- [Convergence analysis] Convergence analysis: The proof substitutes the (1+O(ε)) deviation bound assuming Count Sketch distance preservation holds for all relevant neighbor pairs. However, the per-pair guarantee holds only with probability 1-δ; for a client with |N_i| neighbors, ensuring simultaneous accuracy for all pairs requires a union bound with δ ≤ δ'/|N_i|. The analysis does not adjust δ for network size, so the probability that a Byzantine neighbor's distance is misestimated (potentially causing incorrect acceptance) grows with |N_i|. This directly affects whether the claimed convergence and robustness extend to large networks and should be bounded or stated explicitly.
minor comments (2)
- [Experiments] The experimental protocol should report the precise values of sketch dimension k and model dimension d that achieve the 13,000:1 compression ratio, along with the number of independent runs and standard deviation for the reported TER deviations.
- [Preliminaries] Notation for the effective threshold and the precise statement of the (1+O(ε)) factor should be defined in the main text before being invoked in the convergence theorem.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the convergence analysis. We address the probabilistic guarantee issue point by point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The proof substitutes the (1+O(ε)) deviation bound assuming Count Sketch distance preservation holds for all relevant neighbor pairs. However, the per-pair guarantee holds only with probability 1-δ; for a client with |N_i| neighbors, ensuring simultaneous accuracy for all pairs requires a union bound with δ ≤ δ'/|N_i|. The analysis does not adjust δ for network size, so the probability that a Byzantine neighbor's distance is misestimated (potentially causing incorrect acceptance) grows with |N_i|. This directly affects whether the claimed convergence and robustness extend to large networks and should be bounded or stated explicitly.
Authors: We agree that the current analysis applies the per-pair (1−δ) guarantee without an explicit union bound over |N_i| neighbors. We will revise the proof in the convergence section to incorporate the union bound by setting the per-pair failure probability to δ = δ′/|N_i| (or δ′ over the maximum degree). This ensures simultaneous distance preservation for all neighbors with probability at least 1−δ′. The sketch dimension k scales as O((1/ε²) log(1/δ)), so the adjustment introduces only a logarithmic factor in network size, which remains negligible relative to d and preserves the claimed communication savings. We will update the theorem statements to reflect the high-probability guarantee over the full neighborhood and explicitly discuss the implication for large-scale networks. revision: yes
Circularity Check
No significant circularity; convergence proof relies on external Count Sketch properties
full rationale
The paper's central convergence claim invokes the standard, externally established distance-preservation guarantee of Count Sketches to bound the deviation between sketch-based and full-precision filtering by a (1+O(ε)) factor. This guarantee is a known property from prior literature on sketching, not derived or fitted within the paper itself. No load-bearing step reduces the result to its own inputs by construction, such as self-definitional equations, fitted parameters renamed as predictions, or self-citation chains that substitute for independent verification. The derivation remains self-contained against external benchmarks, with experiments providing separate empirical confirmation. The union-bound concern raised in the skeptic note pertains to proof correctness under scaling rather than any circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- sketch dimension k
axioms (1)
- standard math Count Sketch provides a distance-preservation guarantee with high probability
Reference graph
Works this paper leans on
-
[1]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems30 (2017)
work page 2017
-
[2]
Gilad Baruch, Moran Baruch, and Yoav Goldberg. 2019. A little is enough: Circumventing defenses for distributed learning.Advances in Neural Information Processing Systems32 (2019)
work page 2019
-
[3]
Enrique Tomás Martínez Beltrán et al. 2023. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges.IEEE Com- munications Surveys & Tutorials25, 4 (2023), 2983–3013
work page 2023
-
[4]
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. 2018. signSGD: Compressed optimisation for non-convex problems. InInternational conference on machine learning. PMLR, 560–569
work page 2018
-
[5]
Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer
-
[6]
Advances in neural information processing systems30 (2017)
Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems30 (2017)
work page 2017
-
[7]
1976.Graph theory with applications
John Adrian Bondy, Uppaluri Siva Ramachandra Murty, et al. 1976.Graph theory with applications. Vol. 290. Macmillan London
work page 1976
-
[8]
Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. 2023. Dis- tributed inference and fine-tuning of large language models over the internet. Advances in neural information processing systems36 (2023), 12312–12331
work page 2023
- [9]
- [10]
-
[11]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. InInternational Colloquium on Automata, Languages, and Programming. Springer, 693–703
work page 2002
-
[12]
El Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Arsany Guirguis, Lê-Nguyên Hoang, and Sébastien Rouault. 2021. Collaborative learning in the jungle (decentralized, byzantine, heterogeneous, asynchronous and nonconvex learning).Advances in neural information processing systems34 (2021), 25044– 25057
work page 2021
-
[13]
P. Erdős and A. Rényi. 1960. On the Evolution of Random Graphs.Publication of the Mathematical Institute of the Hungarian Academy of Sciences5 (1960), 17–61
work page 1960
-
[14]
Minghong Fang et al . 2024. Byzantine-robust decentralized federated learn- ing. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2874–2888
work page 2024
-
[15]
Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Gong. 2020. Local model poisoning attacks to {Byzantine-Robust} federated learning. In29th USENIX security symposium (USENIX Security 20). 1605–1622
work page 2020
- [16]
- [17]
-
[18]
Shangwei Guo, Tianwei Zhang, Han Yu, Xiaofei Xie, Lei Ma, Tao Xiang, and Yang Liu. 2021. Byzantine-resilient decentralized stochastic gradient descent.IEEE Transactions on Circuits and Systems for Video Technology32, 6 (2021), 4096–4106
work page 2021
-
[19]
Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, and Mehrdad Mahdavi. 2021. Federated learning with compression: Unified analysis and sharp guarantees. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 2350–2358
work page 2021
- [20]
-
[21]
Jakub Konečn `y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency.arXiv preprint arXiv:1610.05492(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu
-
[23]
Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.Advances in neural information processing systems30 (2017)
work page 2017
- [24]
-
[25]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics. PMLR, 1273–1282
work page 2017
-
[26]
Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. 2017. Towards poisoning of deep learning algorithms with back-gradient optimization. InProceedings of the 10th ACM workshop on artificial intelligence and security. 27–38
work page 2017
-
[27]
Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. 2022. Robust aggregation for federated learning.IEEE Transactions on Signal Processing70 (2022), 1142– 1154
work page 2022
-
[28]
Daniel Rothchild et al. 2020. Fetchsgd: Communication-efficient federated learn- ing with sketching. InInternational Conference on Machine Learning. PMLR, 8253–8265
work page 2020
-
[29]
Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. 2019. Robust and communication-efficient federated learning from non-iid data.IEEE transactions on neural networks and learning systems31, 9 (2019), 3400–3413
work page 2019
-
[30]
Virat Shejwalkar and Amir Houmansadr. 2021. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS
work page 2021
-
[31]
Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory.Advances in neural information processing systems31 (2018)
work page 2018
-
[32]
Peng Sun, Xinyang Liu, Zhibo Wang, and Bo Liu. 2024. Byzantine-robust decen- tralized federated learning via dual-domain clustering and trust bootstrapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24756–24765
work page 2024
-
[33]
Tianxiang Wang, Zhonglong Zheng, and Feilong Lin. 2025. Federated learning framework based on trimmed mean aggregation rules.Expert Systems with Applications(2025), 126354
work page 2025
-
[34]
Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. 2020. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. InUncertainty in artificial intelligence. PMLR, 261–270
work page 2020
-
[35]
Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. 2018. Byzantine-robust distributed learning: Towards optimal statistical rates. InInter- national conference on machine learning. Pmlr, 5650–5659. A Proof of Theorem 1 We provide a detailed proof of the convergence guarantee for SketchGuardunder strongly convex objectives with sketch-based com...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.