pith. sign in

arxiv: 2510.07922 · v4 · submitted 2025-10-09 · 💻 cs.LG · cs.DC

SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Pith reviewed 2026-05-18 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords Byzantine robustnessDecentralized federated learningCount SketchCommunication efficiencyConvergence analysisNon-IID benchmarks
0
0 comments X

The pith

SketchGuard uses Count Sketches to screen neighbors first, so rejected Byzantine nodes cost only O(k) instead of O(d) in decentralized federated learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a way to make Byzantine-robust decentralized federated learning scale by separating the screening of neighbors from the exchange of full models. Current defenses send complete high-dimensional vectors to every neighbor before deciding whom to trust, which wastes communication on attackers that will be rejected anyway. SketchGuard instead has each client send a much smaller Count Sketch of its model to all neighbors for initial distance-based filtering, then pulls full models only from the accepted subset. Convergence proofs for both strongly convex and non-convex cases show that the sketch approximation shifts the effective filtering threshold by no more than a (1 + O(ε)) factor, a gap that shrinks with larger sketches. Experiments on non-IID benchmarks confirm that this keeps robustness within 0.5 percentage points of full-precision methods while cutting computation by up to 82 percent even at compression ratios exceeding 13,000 to 1.

Core claim

SketchGuard decouples Byzantine filtering from aggregation by exchanging only k-dimensional Count Sketches for neighbor screening and fetching full d-dimensional models solely from accepted neighbors. Count Sketch's distance-preservation property ensures that the sketch-based filter deviates from a full-precision filter by at most a (1 + O(ε)) factor in the effective threshold, which can be driven arbitrarily close to 1. This yields communication savings that grow with the Byzantine rejection rate, reaching 50-70 percent when half or more neighbors are malicious, while matching state-of-the-art robustness across multiple topologies and attack types.

What carries the argument

Count Sketch distance-preservation guarantee applied to neighbor screening before full-model aggregation.

If this is right

  • Communication savings scale directly with the fraction of rejected neighbors and become negligible only when almost all neighbors are honest.
  • Convergence holds in both strongly convex and non-convex regimes with an arbitrarily small gap to the exact-threshold case.
  • Robustness remains stable across five network topologies and four attack types even when sketches are compressed by factors up to 13,000:1.
  • Computation drops by as much as 82 percent relative to full-precision robust baselines while mean TER deviation stays at or below 0.5 percentage points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same screening-before-fetch pattern could reduce overhead in any distance-based defense that currently exchanges full vectors with all participants.
  • For very large models the relative savings would grow further because the fixed sketch size k stays independent of dimension d.
  • Choosing sketch size to target a specific ε lets practitioners trade a quantifiable robustness gap for predictable communication reduction without retraining the defense logic.

Load-bearing premise

The distance-preservation guarantee of Count Sketch ensures that sketch-based filtering deviates from full-precision filtering by at most a (1+O(ε)) factor in the effective threshold.

What would settle it

An experiment or counterexample in which the observed deviation between sketch-based and full-precision filtering thresholds exceeds the claimed (1 + O(ε)) bound for a chosen sketch size, or in which SketchGuard's test error rate on one of the three non-IID benchmarks deviates by more than 0.5 percentage points from the full-precision baseline under any of the four attack types.

Figures

Figures reproduced from arXiv: 2510.07922 by Farag Azzedin, Murtaza Rangwala, Rajkumar Buyya, Richard O. Sinnott.

Figure 1
Figure 1. Figure 1: The SketchGuard Protocol 3.2 Complexity Analysis and Performance Trade-offs We analyze the per-node-per-round complexity of SketchGuard compared to the best-case computation and communication com￾plexity of existing Byzantine-robust methods. Let 𝑑 denote the model dimension, |N𝑖 | the number of neighbors, 𝑘 the sketch size, and |S𝑡 𝑖 | the number of accepted neighbors. For computational complexity, SketchG… view at source ↗
Figure 2
Figure 2. Figure 2: Network topologies used in the robustness evaluation experiments. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of fraction of malicious clients on TER across different datasets and attack types. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of fraction of malicious clients on TER across different network topologies. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of network size and model dimensionality [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers but remains vulnerable to Byzantine attacks. Existing Byzantine-robust defenses are predicated on exchanging full, high-dimensional model vectors with every neighbor before filtering, an $O(d|\mathcal{N}_i|)$ communication cost incurred regardless of how many neighbors are ultimately rejected. This design choice is sustainable in small-scale experimental settings but becomes a fundamental barrier to deployment as network scale or model size grows. We propose SketchGuard, a framework that decouples Byzantine filtering from aggregation via sketch-based screening. Each client compresses its $d$-dimensional model to a $k$-dimensional Count Sketch ($k \ll d$), exchanges only sketches for neighbor screening, and fetches full models exclusively from accepted neighbors. This eliminates the pre-filtering communication waste of existing defenses: rejected Byzantine neighbors incur only $O(k)$ sketch cost rather than $O(d)$ full-model cost. Communication savings therefore scale with the Byzantine rejection rate: negligible extra overhead in benign conditions, rising to 50-70% total savings when 50-70% of neighbors are rejected. We prove convergence in both strongly convex and non-convex settings, establishing that Count Sketch's distance-preservation guarantee causes sketch-based filtering to deviate from full-precision filtering by at most a $(1+O(\epsilon))$ factor in the effective threshold, a gap that can be made arbitrarily small. Experiments across three non-IID federated benchmarks, five network topologies, and four attack types confirm that SketchGuard matches state-of-the-art robustness (mean TER deviation $\leq$0.5 percentage points) while reducing computation by up to 82%, with robustness remaining stable across compression ratios up to 13,000:1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SketchGuard for Byzantine-robust decentralized federated learning. It decouples neighbor screening from aggregation by exchanging k-dimensional Count Sketches (k ≪ d) instead of full d-dimensional models, incurring O(k) cost only for rejected Byzantine neighbors. The authors prove convergence for both strongly convex and non-convex objectives, showing that the sketch-based filter deviates from full-precision filtering by at most a (1+O(ε)) factor in the effective threshold. Experiments across three non-IID benchmarks, five topologies, and four attack types report matching SOTA robustness (mean TER deviation ≤0.5 pp) with up to 82% computation reduction and stability at compression ratios up to 13,000:1.

Significance. If the guarantees hold, the work removes a fundamental scalability barrier in DFL by making pre-filter communication cost proportional to the rejection rate rather than fixed at O(d|N_i|). Reliance on standard, externally established Count Sketch distance-preservation properties yields a clean, largely parameter-free derivation. The breadth of experimental validation (multiple benchmarks, topologies, and attacks) supports practical relevance for large-scale deployments.

major comments (1)
  1. [Convergence analysis] Convergence analysis: The proof substitutes the (1+O(ε)) deviation bound assuming Count Sketch distance preservation holds for all relevant neighbor pairs. However, the per-pair guarantee holds only with probability 1-δ; for a client with |N_i| neighbors, ensuring simultaneous accuracy for all pairs requires a union bound with δ ≤ δ'/|N_i|. The analysis does not adjust δ for network size, so the probability that a Byzantine neighbor's distance is misestimated (potentially causing incorrect acceptance) grows with |N_i|. This directly affects whether the claimed convergence and robustness extend to large networks and should be bounded or stated explicitly.
minor comments (2)
  1. [Experiments] The experimental protocol should report the precise values of sketch dimension k and model dimension d that achieve the 13,000:1 compression ratio, along with the number of independent runs and standard deviation for the reported TER deviations.
  2. [Preliminaries] Notation for the effective threshold and the precise statement of the (1+O(ε)) factor should be defined in the main text before being invoked in the convergence theorem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the convergence analysis. We address the probabilistic guarantee issue point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The proof substitutes the (1+O(ε)) deviation bound assuming Count Sketch distance preservation holds for all relevant neighbor pairs. However, the per-pair guarantee holds only with probability 1-δ; for a client with |N_i| neighbors, ensuring simultaneous accuracy for all pairs requires a union bound with δ ≤ δ'/|N_i|. The analysis does not adjust δ for network size, so the probability that a Byzantine neighbor's distance is misestimated (potentially causing incorrect acceptance) grows with |N_i|. This directly affects whether the claimed convergence and robustness extend to large networks and should be bounded or stated explicitly.

    Authors: We agree that the current analysis applies the per-pair (1−δ) guarantee without an explicit union bound over |N_i| neighbors. We will revise the proof in the convergence section to incorporate the union bound by setting the per-pair failure probability to δ = δ′/|N_i| (or δ′ over the maximum degree). This ensures simultaneous distance preservation for all neighbors with probability at least 1−δ′. The sketch dimension k scales as O((1/ε²) log(1/δ)), so the adjustment introduces only a logarithmic factor in network size, which remains negligible relative to d and preserves the claimed communication savings. We will update the theorem statements to reflect the high-probability guarantee over the full neighborhood and explicitly discuss the implication for large-scale networks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; convergence proof relies on external Count Sketch properties

full rationale

The paper's central convergence claim invokes the standard, externally established distance-preservation guarantee of Count Sketches to bound the deviation between sketch-based and full-precision filtering by a (1+O(ε)) factor. This guarantee is a known property from prior literature on sketching, not derived or fitted within the paper itself. No load-bearing step reduces the result to its own inputs by construction, such as self-definitional equations, fitted parameters renamed as predictions, or self-citation chains that substitute for independent verification. The derivation remains self-contained against external benchmarks, with experiments providing separate empirical confirmation. The union-bound concern raised in the skeptic note pertains to proof correctness under scaling rather than any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard mathematical properties of Count Sketches and background convergence results from federated optimization; no new entities are postulated and the only notable free parameter is the sketch dimension chosen to control the approximation factor.

free parameters (1)
  • sketch dimension k
    Chosen much smaller than model dimension d to achieve compression; controls the approximation factor ε in the distance-preservation bound.
axioms (1)
  • standard math Count Sketch provides a distance-preservation guarantee with high probability
    Invoked to bound the deviation between sketch-based and full-precision filtering thresholds in the convergence proof.

pith-pipeline@v0.9.0 · 5862 in / 1449 out tokens · 47411 ms · 2026-05-18T08:51:43.881442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems30 (2017)

  2. [2]

    Gilad Baruch, Moran Baruch, and Yoav Goldberg. 2019. A little is enough: Circumventing defenses for distributed learning.Advances in Neural Information Processing Systems32 (2019)

  3. [3]

    Enrique Tomás Martínez Beltrán et al. 2023. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges.IEEE Com- munications Surveys & Tutorials25, 4 (2023), 2983–3013

  4. [4]

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. 2018. signSGD: Compressed optimisation for non-convex problems. InInternational conference on machine learning. PMLR, 560–569

  5. [5]

    Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer

  6. [6]

    Advances in neural information processing systems30 (2017)

    Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems30 (2017)

  7. [7]

    1976.Graph theory with applications

    John Adrian Bondy, Uppaluri Siva Ramachandra Murty, et al. 1976.Graph theory with applications. Vol. 290. Macmillan London

  8. [8]

    Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. 2023. Dis- tributed inference and fine-tuning of large language models over the internet. Advances in neural information processing systems36 (2023), 12312–12331

  9. [9]

    Diego Cajaraville-Aboy, Ana Fernández-Vilas, Rebeca P Díaz-Redondo, and Manuel Fernández-Veiga. 2024. Byzantine-robust aggregation for securing de- centralized federated learning.arXiv preprint arXiv:2409.17754(2024)

  10. [10]

    Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn`y, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2018. Leaf: A benchmark for federated settings.arXiv preprint arXiv:1812.01097(2018)

  11. [11]

    Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. InInternational Colloquium on Automata, Languages, and Programming. Springer, 693–703

  12. [12]

    El Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Arsany Guirguis, Lê-Nguyên Hoang, and Sébastien Rouault. 2021. Collaborative learning in the jungle (decentralized, byzantine, heterogeneous, asynchronous and nonconvex learning).Advances in neural information processing systems34 (2021), 25044– 25057

  13. [13]

    Erdős and A

    P. Erdős and A. Rényi. 1960. On the Evolution of Random Graphs.Publication of the Mathematical Institute of the Hungarian Academy of Sciences5 (1960), 17–61

  14. [14]

    Minghong Fang et al . 2024. Byzantine-robust decentralized federated learn- ing. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2874–2888

  15. [15]

    Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Gong. 2020. Local model poisoning attacks to {Byzantine-Robust} federated learning. In29th USENIX security symposium (USENIX Security 20). 1605–1622

  16. [16]

    Guillaume Garrigos and Robert M Gower. 2023. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235 (2023)

  17. [17]

    Vineet Sunil Gattani, Junshan Zhang, and Gautam Dasarathy. 2024. Communication-Efficient Federated Learning over Wireless Channels via Gradient Sketching.arXiv preprint arXiv:2410.23424(2024)

  18. [18]

    Shangwei Guo, Tianwei Zhang, Han Yu, Xiaofei Xie, Lei Ma, Tao Xiang, and Yang Liu. 2021. Byzantine-resilient decentralized stochastic gradient descent.IEEE Transactions on Circuits and Systems for Video Technology32, 6 (2021), 4096–4106

  19. [19]

    Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, and Mehrdad Mahdavi. 2021. Federated learning with compression: Unified analysis and sharp guarantees. InInternational Conference on Artificial Intelligence and Statistics. PMLR, 2350–2358

  20. [20]

    Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. 2022. Byzantine-robust decentralized learning via clippedgossip.arXiv preprint arXiv:2202.01545(2022)

  21. [21]

    Jakub Konečn `y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency.arXiv preprint arXiv:1610.05492(2016)

  22. [22]

    Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu

  23. [23]

    Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.Advances in neural information processing systems30 (2017)

  24. [24]

    Alexander Long. 2024. Protocol Learning, Decentralized Frontier Risk and the No-Off Problem.arXiv preprint arXiv:2412.07890(2024)

  25. [25]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep net- works from decentralized data. InArtificial intelligence and statistics. PMLR, 1273–1282

  26. [26]

    Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. 2017. Towards poisoning of deep learning algorithms with back-gradient optimization. InProceedings of the 10th ACM workshop on artificial intelligence and security. 27–38

  27. [27]

    Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. 2022. Robust aggregation for federated learning.IEEE Transactions on Signal Processing70 (2022), 1142– 1154

  28. [28]

    Daniel Rothchild et al. 2020. Fetchsgd: Communication-efficient federated learn- ing with sketching. InInternational Conference on Machine Learning. PMLR, 8253–8265

  29. [29]

    Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. 2019. Robust and communication-efficient federated learning from non-iid data.IEEE transactions on neural networks and learning systems31, 9 (2019), 3400–3413

  30. [30]

    Virat Shejwalkar and Amir Houmansadr. 2021. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS

  31. [31]

    Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory.Advances in neural information processing systems31 (2018)

  32. [32]

    Peng Sun, Xinyang Liu, Zhibo Wang, and Bo Liu. 2024. Byzantine-robust decen- tralized federated learning via dual-domain clustering and trust bootstrapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24756–24765

  33. [33]

    Tianxiang Wang, Zhonglong Zheng, and Feilong Lin. 2025. Federated learning framework based on trimmed mean aggregation rules.Expert Systems with Applications(2025), 126354

  34. [34]

    Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. 2020. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. InUncertainty in artificial intelligence. PMLR, 261–270

  35. [35]

    Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. 2018. Byzantine-robust distributed learning: Towards optimal statistical rates. InInter- national conference on machine learning. Pmlr, 5650–5659. A Proof of Theorem 1 We provide a detailed proof of the convergence guarantee for SketchGuardunder strongly convex objectives with sketch-based com...