Hammer and Anvil: Toward a Theory of Backdoors in Federated Learning
Pith reviewed 2026-05-18 17:20 UTC · model grok-4.3
The pith
A principled combination of outlier detection and removal-based defenses defeats adaptive backdoor attacks in federated learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Backdoors can be categorized by the deviation δ of their updates from the mean update. Type 1 defenses handle large-δ attacks through outlier detection and robust aggregation, while Type 2 defenses address small-δ attacks through removal. Single-type defenses and non-principled combinations leave exploitable gaps for adaptive adversaries, but principled combinations of Type 1 and Type 2, such as HA_Flame^CSFT, HA_Krum^CSFT, and HA_Multi-Metrics^CSFT, remain effective against a full-information adaptive adversary across various datasets and settings.
What carries the argument
The Hammer and Anvil framework that classifies backdoors according to the scalar deviation δ of malicious updates from the mean and combines Type 1 (Anvil) outlier detection with Type 2 (Hammer) removal defenses.
Load-bearing premise
That all backdoor attacks can be effectively classified and blocked using the scalar deviation δ of their update from the mean update.
What would settle it
Demonstration of a backdoor attack succeeding against HA_Flame^CSFT or similar combined defense when the attacker knows the benign updates, aggregation algorithm, and all parameters.
Figures
read the original abstract
Federated Learning (FL) enables distributed model training but is vulnerable to backdoor attacks, where malicious clients embed attacker-controlled behaviors into the global model. Existing defenses fail against adaptive adversaries. In this paper, we present "Hammer and Anvil", a principled theoretical framework that categorizes backdoors by the deviation, $\delta$, of their updates to the mean of the updates. We identify two fundamental defense types: "Type 1 (The Anvil)", comprising outlier detection and robust aggregation effective against large-deviation attacks, and "Type 2 (The Hammer)", consisting of removal-based defenses effective against small-deviation attacks. We demonstrate that defenses of a single type and non-principled combined defenses inherently leave an exploitable gap for adaptive attackers. To bridge this gap, we propose the principled combination of Type 1 and Type 2 defenses. We evaluate our framework against a new, worst-case, full-information adaptive adversary that knows the benign updates, the aggregation algorithm, and its parameters, and yet this adversary fails against our combined defenses. Our empirical evaluation across various datasets and settings shows that single-typed and non-principled combined defenses are easily broken, often by a single malicious client. In contrast, our best combined defense variants, $HA_{Flame}^{CSFT}$, $HA_{Krum}^{CSFT}$, and $HA_{Multi-Metrics}^{CSFT}$, remain undefeated even in the most adversarial settings. Our results provide a principled approach for research on backdoors in federated learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the 'Hammer and Anvil' framework for backdoor attacks in federated learning. Attacks are categorized by the scalar deviation δ of a malicious update from the mean of benign updates. Type-1 defenses (outlier detection and robust aggregation) target large-δ attacks while Type-2 defenses (removal-based) target small-δ attacks. The central claim is that single-type defenses and non-principled combinations leave exploitable gaps for adaptive adversaries, but principled Type-1-plus-Type-2 combinations close the gap. This is evaluated against a new worst-case full-information adaptive adversary that knows benign updates, the aggregation rule, and all parameters; empirical results across datasets show that variants such as HA_Flame^CSFT, HA_Krum^CSFT, and HA_Multi-Metrics^CSFT remain effective while single-type and non-principled defenses are defeated, often by one malicious client.
Significance. If the central claims hold, the work supplies a useful organizing principle for combining existing FL defenses and demonstrates that principled combinations can resist a strong full-information adversary where isolated defenses fail. The explicit worst-case adversary model and the breadth of empirical settings are positive contributions. The framework may help explain observed defense failures and guide future designs, though its generality hinges on whether the scalar-δ classification is exhaustive.
major comments (3)
- [Abstract and §3] Abstract and §3 (framework definition): the claim that every backdoor update is 'usefully described by a single scalar δ' and that the principled combination therefore necessarily covers the entire space is load-bearing for the 'toward a theory' contribution. The manuscript must show, either by proof or exhaustive case analysis, that an adaptive full-information adversary cannot produce an update whose effective deviation lies in the transition region or exploits the particular metric used to compute δ, thereby evading both defense types simultaneously.
- [§4 and experimental section] §4 (adversary model) and experimental section: the description of how the full-information adversary constructs its update (knowing the exact benign updates, aggregation parameters, and defense thresholds) is insufficiently detailed. Without an explicit algorithm or pseudocode showing the adversary's optimization over δ and direction, it is impossible to confirm that the reported 'undefeated' status of the combined variants is not an artifact of an incomplete adversary implementation.
- [Results table/figure] Table or figure reporting attack success rates (e.g., the table or plot containing HA_Flame^CSFT results): the manuscript should report the precise δ threshold used to separate Type-1 and Type-2 regimes and demonstrate that the adversary was allowed to optimize across that threshold; if the threshold is a free parameter, the claim that the combination is 'principled' and gap-free requires a sensitivity analysis.
minor comments (2)
- [Notation and abstract] The superscript notation HA_Flame^CSFT (and analogous variants) is introduced without an immediate expansion; define the components (CSFT, etc.) at first use.
- [Figures] Ensure all figures plotting attack success versus number of malicious clients or δ values include error bars or multiple random seeds and clearly distinguish the single-type, non-principled, and principled-combination curves.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (framework definition): the claim that every backdoor update is 'usefully described by a single scalar δ' and that the principled combination therefore necessarily covers the entire space is load-bearing for the 'toward a theory' contribution. The manuscript must show, either by proof or exhaustive case analysis, that an adaptive full-information adversary cannot produce an update whose effective deviation lies in the transition region or exploits the particular metric used to compute δ, thereby evading both defense types simultaneously.
Authors: We agree that substantiating the exhaustiveness of the scalar-δ categorization is central to the theoretical framing. Section 3 defines δ as the Euclidean deviation from the benign mean and argues that this scalar distinguishes the regimes targeted by Type-1 versus Type-2 defenses. We do not supply a formal proof that no update can evade by metric exploitation or by landing precisely in a transition region; instead, the framework treats δ as a useful organizing scalar for any update in the space. To respond, we will add a dedicated paragraph in the revised §3 that provides additional justification for why metric-specific evasion is countered by the combined defense and includes a short case analysis of transition-region attempts. This addition will be supported by the existing empirical results in which the full-information adversary, free to select any δ, still fails against the principled combinations. The revision is therefore partial, as a complete formal proof would require further theoretical work beyond the present scope. revision: partial
-
Referee: [§4 and experimental section] §4 (adversary model) and experimental section: the description of how the full-information adversary constructs its update (knowing the exact benign updates, aggregation parameters, and defense thresholds) is insufficiently detailed. Without an explicit algorithm or pseudocode showing the adversary's optimization over δ and direction, it is impossible to confirm that the reported 'undefeated' status of the combined variants is not an artifact of an incomplete adversary implementation.
Authors: We accept that the adversary construction requires greater explicitness. Section 4 states the adversary's knowledge (benign updates, aggregation rule, all defense parameters) and its objective, yet omits step-by-step construction details. In the revision we will insert a new algorithm box that specifies the procedure: (i) compute the benign mean from the known updates, (ii) perform a search over candidate δ magnitudes and perturbation directions, (iii) evaluate each candidate against the known defense thresholds, and (iv) select the update that maximizes backdoor success while attempting to evade the combined defense. This addition will make the optimization transparent and allow independent verification that the reported resilience of HA_Flame^CSFT, HA_Krum^CSFT, and HA_Multi-Metrics^CSFT is not an artifact of an under-powered adversary. revision: yes
-
Referee: [Results table/figure] Table or figure reporting attack success rates (e.g., the table or plot containing HA_Flame^CSFT results): the manuscript should report the precise δ threshold used to separate Type-1 and Type-2 regimes and demonstrate that the adversary was allowed to optimize across that threshold; if the threshold is a free parameter, the claim that the combination is 'principled' and gap-free requires a sensitivity analysis.
Authors: We agree that the separation threshold and cross-threshold optimization must be stated explicitly. The threshold is not an arbitrary free parameter; it is derived from the concrete Type-1 defense (e.g., neighbor-selection radius in Krum or clustering cutoff in Flame). We will revise the experimental section to list the exact numerical threshold applied in each reported setting. In addition, we will include a sensitivity plot (new figure or appendix) that varies the threshold while allowing the adversary to optimize δ continuously across the boundary. The results of this analysis will be summarized in the text to confirm that the principled Type-1-plus-Type-2 combinations remain effective even when the adversary is granted freedom to straddle the regimes. revision: yes
Circularity Check
No significant circularity; conceptual framework is self-contained
full rationale
The paper introduces a categorization of backdoor attacks by scalar deviation δ from the mean update and defines Type 1 (outlier/robust aggregation for large δ) and Type 2 (removal for small δ) defenses as a new organizational lens. It conceptually argues that single-type defenses and non-principled combinations leave exploitable gaps for adaptive adversaries while principled Type-1-plus-Type-2 combinations close them, supported by empirical evaluation against a full-information adversary. No equations, fitted parameters, or self-citations are shown to reduce the central claims to their own inputs by construction; the derivation is an independent conceptual organization rather than a tautological or self-referential loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- deviation threshold separating Type-1 and Type-2 regimes
axioms (1)
- domain assumption Backdoor attacks can be exhaustively categorized by the deviation of their updates from the mean of benign updates.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
δ represents the magnitude of a given backdoor attack, with weakly inserted backdoors with smaller l2-norm updates having a small δ and larger l2-norm attack updates having a larger δ. ... an attacker can only succeed for δ∈[δ₂,δ₁]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Communication- efficient learning of deep networks from decentralized data
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–
-
[2]
Badnets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019
Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019
work page 2019
-
[3]
Dba: Distributed backdoor attacks against federated learning
Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. Dba: Distributed backdoor attacks against federated learning. InInternational conference on learning representations, 2019
work page 2019
-
[4]
Neurotoxin: Durable backdoors in federated learning
Zhengming Zhang, Ashwinee Panda, Linyue Song, Yao- qing Yang, Michael Mahoney, Prateek Mittal, Ram- chandran Kannan, and Joseph Gonzalez. Neurotoxin: Durable backdoors in federated learning. InInterna- tional Conference on Machine Learning, pages 26429– 26446. PMLR, 2022
work page 2022
-
[5]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learn- ing systems using data poisoning.arXiv preprint arXiv:1712.05526, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Forget-me-not: Making backdoor hard to be forgotten in fine-tuning
Tran Ngoc Huynh, Anh Tuan Tran, Khoa D Doan, and Tung Pham. Forget-me-not: Making backdoor hard to be forgotten in fine-tuning
-
[7]
How to backdoor fed- erated learning
Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Debo- rah Estrin, and Vitaly Shmatikov. How to backdoor fed- erated learning. InInternational conference on artificial intelligence and statistics, pages 2938–2948. PMLR, 2020
work page 2020
-
[8]
Thuy Dung Nguyen, Tuan Nguyen, Phi Le Nguyen, Hieu H Pham, Khoa D Doan, and Kok-Seng Wong. Backdoor attacks and defenses in federated learning: Survey, challenges and future research directions.En- gineering Applications of Artificial Intelligence, 127: 107166, 2024
work page 2024
-
[9]
Peva Blanchard, El Mahdi El Mhamdi, Rachid Guer- raoui, and Julien Stainer. Machine learning with adver- saries: Byzantine tolerant gradient descent.Advances in neural information processing systems, 30, 2017
work page 2017
-
[10]
Efficient median of means estimator
Stanislav Minsker. Efficient median of means estimator. InThe Thirty Sixth Annual Conference on Learning Theory, pages 5925–5933. PMLR, 2023
work page 2023
-
[11]
Can you really backdoor federated learning?
Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, and H Brendan McMahan. Can you really backdoor federated learning?arXiv preprint arXiv:1911.07963, 2019
-
[12]
Flare: defending federated learning against model poisoning attacks via latent space representations
Ning Wang, Yang Xiao, Yimin Chen, Yang Hu, Wenjing Lou, and Y Thomas Hou. Flare: defending federated learning against model poisoning attacks via latent space representations. InProceedings of the 2022 ACM on Asia Conference on Computer and Communications Se- curity, pages 946–958, 2022
work page 2022
-
[13]
{FLAME}: Taming backdoors in fed- erated learning
Thien Duc Nguyen, Phillip Rieger, Huili Chen, Hossein Yalame, Helen Möllering, Hossein Fereidooni, Samuel Marchal, Markus Miettinen, Azalia Mirhoseini, Shaza Zeitouni, et al. {FLAME}: Taming backdoors in fed- erated learning. In31st USENIX security symposium (USENIX Security 22), pages 1415–1432, 2022
work page 2022
-
[14]
Mitigating backdoor attacks in federated learning.arXiv preprint arXiv:2011.01767, 2020
Chen Wu, Xian Yang, Sencun Zhu, and Prasenjit Mitra. Mitigating backdoor attacks in federated learning.arXiv preprint arXiv:2011.01767, 2020
-
[15]
Fine-pruning: Defending against backdooring attacks on deep neural networks
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. InInternational symposium on research in attacks, intrusions, and defenses, pages 273–294. Springer, 2018
work page 2018
-
[16]
Fine-tuning is all you need to miti- gate backdoor attacks.arXiv preprint arXiv:2212.09067, 2022
Zeyang Sha, Xinlei He, Pascal Berrang, Mathias Hum- bert, and Yang Zhang. Fine-tuning is all you need to miti- gate backdoor attacks.arXiv preprint arXiv:2212.09067, 2022. 14
-
[17]
Poisoning with cerberus: Stealthy and colluded backdoor attack against federated learning
Xiaoting Lyu, Yufei Han, Wei Wang, Jingkai Liu, Bin Wang, Jiqiang Liu, and Xiangliang Zhang. Poisoning with cerberus: Stealthy and colluded backdoor attack against federated learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9020–9028, 2023
work page 2023
-
[18]
Sba: A swift and stealthy backdoor attack framework for federated learning
Junhan Wang, Zhangming Wu, Zhuoyue Wang, and Lu Dong. Sba: A swift and stealthy backdoor attack framework for federated learning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025
work page 2025
-
[19]
Xiaoting Lyu, Yufei Han, Wei Wang, Jingkai Liu, Yong- sheng Zhu, Guangquan Xu, Jiqiang Liu, and Xiangliang Zhang. Lurking in the shadows: Unveiling stealthy back- door attacks against personalized federated learning. In 33rd USENIX Security Symposium (USENIX Security 24), pages 4157–4174, 2024
work page 2024
-
[20]
John S Bridle. Probabilistic interpretation of feedfor- ward classification network outputs, with relationships to statistical pattern recognition. InNeurocomputing: Algorithms, architectures and applications, pages 227–
-
[21]
Analyzing federated learning through an adversarial lens
Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mit- tal, and Seraphin Calo. Analyzing federated learning through an adversarial lens. InInternational conference on machine learning, pages 634–643. PMLR, 2019
work page 2019
-
[22]
Xianghua Xie, Chen Hu, Hanchi Ren, and Jingjing Deng. A survey on vulnerability of federated learn- ing: A learning algorithm perspective.Neurocomputing, 573:127225, 2024
work page 2024
-
[23]
Asadullah Tariq, Mohamed Adel Serhani, Farag M Sal- labi, Ezedin S Barka, Tariq Qayyum, Heba M Khater, and Khaled A Shuaib. Trustworthy federated learning: A comprehensive review, architecture, key challenges, and future research prospects.IEEE Open Journal of the Communications Society, 2024
work page 2024
-
[24]
A survey of security threats in federated learning.Complex & Intelligent Systems, 11(2):1–26, 2025
Yunhao Feng, Yanming Guo, Yinjian Hou, Yulun Wu, Mingrui Lao, Tianyuan Yu, and Gang Liu. A survey of security threats in federated learning.Complex & Intelligent Systems, 11(2):1–26, 2025
work page 2025
-
[25]
Zhaozheng Li, Jiahe Lan, Zheng Yan, and Erol Gelenbe. Backdoor attacks and defense mechanisms in federated learning: A survey.Information Fusion, page 103248, 2025
work page 2025
-
[26]
Yichen Wan, Youyang Qu, Wei Ni, Yong Xiang, Longxi- ang Gao, and Ekram Hossain. Data and model poisoning backdoor attacks on wireless federated learning, and the defense mechanisms: A comprehensive survey.IEEE Communications Surveys & Tutorials, 26(3):1861–1897, 2024
work page 2024
-
[27]
Jakub Kacper Szel ˛ ag, Ji-Jian Chin, and Sook-Chin Yip. Adaptive adversaries in byzantine-robust federated learn- ing: A survey.Cryptology ePrint Archive, 2025
work page 2025
-
[28]
An investigation of recent backdoor attacks and defenses in federated learn- ing
Qiuxian Chen and Yizheng Tao. An investigation of recent backdoor attacks and defenses in federated learn- ing. In2023 Eighth International Conference on Fog and Mobile Edge Computing (FMEC), pages 262–269. IEEE, 2023
work page 2023
-
[29]
Mesas: Poi- soning defense for federated learning resilient against adaptive attackers
Torsten Krauß and Alexandra Dmitrienko. Mesas: Poi- soning defense for federated learning resilient against adaptive attackers. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1526–1540, 2023
work page 2023
-
[30]
Automatic adversarial adaption for stealthy poisoning attacks in federated learning
Torsten Krauß, Jan König, Alexandra Dmitrienko, and Christian Kanzow. Automatic adversarial adaption for stealthy poisoning attacks in federated learning. InTo appear soon at the Network and Distributed System Security Symposium (NDSS), 2024
work page 2024
-
[31]
The hidden vulnerability of distributed learning in byzantium
Rachid Guerraoui, Sébastien Rouault, et al. The hidden vulnerability of distributed learning in byzantium. In International conference on machine learning, pages 3521–3530. PMLR, 2018
work page 2018
-
[32]
Byzantine-robust distributed learning: Towards optimal statistical rates
Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. InInternational con- ference on machine learning, pages 5650–5659. Pmlr, 2018
work page 2018
-
[33]
Learning multiple layers of features from tiny images, 2009
Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. InTechnical report
work page 2009
-
[34]
Rofl: Robustness of secure federated learning
Hidde Lycklama, Lukas Burkhalter, Alexander Viand, Nicolas Küchler, and Anwar Hithnawi. Rofl: Robustness of secure federated learning. In2023 IEEE Symposium on Security and Privacy (SP), pages 453–476. IEEE, 2023
work page 2023
-
[35]
Haomin Zhuang, Mingxian Yu, Hao Wang, Yang Hua, Jian Li, and Xu Yuan. Backdoor federated learning by poisoning backdoor-critical layers.arXiv preprint arXiv:2308.04466, 2023
-
[36]
The mnist database of handwritten digits,
Yann LeCun. The mnist database of handwritten digits,
-
[37]
InTechnical report. Appendix 15 0 25 50 75 100 Epoch 0.2 0.4 0.6 0.8Clean Accuracy Accuracy badnet krum local m=1 m=2 m=4 m=8 0 25 50 75 100 Epoch 0.5 0.6 0.7 0.8 0.9 1.0ASR ASR badnet krum success m=1 m=2 m=4 m=8 0 25 50 75 100 0.2 0.4 0.6 0.8 Accuracy blended krum 0 25 50 75 100 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ASR blended krum 0 25 50 75 100 0.2 0.4 0.6 0.8...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.