Are Targeted Data Poisoning Attacks as Effective as We Think?
Pith reviewed 2026-05-25 07:38 UTC · model grok-4.3
The pith
Metrics from clean training dynamics and poison distances can stratify test samples by their vulnerability to targeted poisoning attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a test dataset, the authors identify the easiest and hardest examples to poison using only clean model information: coarse evaluation via clean training dynamics and fine-grained classification via poison distances and budgets. Experiments show these metrics reliably stratify samples by poisoning vulnerability.
What carries the argument
Vulnerability stratification metrics computed from clean training dynamics and poison distances that rank samples without executing attacks.
If this is right
- Evaluations of targeted poisoning attacks should report success on the hardest samples rather than averages over random targets.
- Defenders can use the metrics to identify the most vulnerable samples in advance and apply countermeasures only where needed.
- The same clean-only metrics support both worst-case attack assessment and proactive, sample-specific defense.
- Reported attack success rates may need downward revision when restricted to the hardest targets.
Where Pith is reading between the lines
- The clean-metric approach could be tested on non-targeted poisoning or backdoor settings to see whether the same dynamics predict vulnerability.
- Periodic recomputation of the metrics during training might allow defenses to adapt as the model evolves.
- Combining these rankings with other robustness signals could produce a broader vulnerability map for a given dataset.
Load-bearing premise
The rankings produced by clean dynamics and poison distances will still predict actual attack success when the attacker chooses poisons adaptively or when the model architecture or training procedure changes.
What would settle it
An experiment in which an adaptive attacker achieves comparable success rates on samples the metrics label as hard versus samples labeled as easy would falsify the stratification claim.
Figures
read the original abstract
Targeted data poisoning attacks manipulate model predictions on specific test samples by injecting malicious data into training. Yet existing evaluations report average attack success rates over randomly selected targets, obscuring true worst-case effectiveness. We argue that the right evaluation focuses on the hardest samples to poison. The same reasoning applies to defense: since targeted attacks leave no footprint at the distribution level, defenders should proactively identify the most vulnerable samples and apply targeted countermeasures. Given a test dataset, this paper identifies both the easiest and hardest to poison examples based on only clean model information. Specifically, we offer coarse evaluations using clean training dynamics, and fine-grained classification on poison class using poison distances and budgets. Our experiments show these metrics reliably stratify samples by poisoning vulnerability, enabling both rigorous worst-case evaluation and proactive vulnerability-aware defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that targeted data poisoning evaluations should focus on the hardest-to-poison samples rather than average success rates over random targets. It proposes metrics derived from clean training dynamics for coarse evaluation and from poison distances and budgets for fine-grained classification to identify vulnerable samples. Experiments are claimed to demonstrate that these metrics reliably stratify samples by poisoning vulnerability, enabling rigorous worst-case evaluation and proactive vulnerability-aware defense.
Significance. If the metrics generalize, the work could shift poisoning research toward worst-case analysis and allow defenders to preemptively target vulnerable samples using only clean runs. The empirical focus on clean-dynamics metrics is a practical strength, but significance is reduced because the central stratification claim has not been tested against adaptive attackers or shifted training procedures.
major comments (2)
- [Experiments] Experiments section: stratification is demonstrated only for the paper's non-adaptive poison generation procedure; no ablation or test is provided for an adaptive attacker who selects poisons to maximize success while knowing or optimizing against the clean-dynamics metrics, directly undermining the claim of enabling 'rigorous worst-case evaluation'.
- [Abstract] Abstract: the assertion that 'experiments show these metrics reliably stratify samples by poisoning vulnerability' is presented without any reported correlation, AUC, success-rate deltas, error bars, or baseline comparisons, leaving the empirical support for the central claim difficult to evaluate.
minor comments (1)
- [Methods] The definitions of 'poison distances' and 'budgets' should be stated explicitly with formulas in the methods section to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, proposing revisions where they strengthen the work without misrepresenting our contributions.
read point-by-point responses
-
Referee: [Experiments] Experiments section: stratification is demonstrated only for the paper's non-adaptive poison generation procedure; no ablation or test is provided for an adaptive attacker who selects poisons to maximize success while knowing or optimizing against the clean-dynamics metrics, directly undermining the claim of enabling 'rigorous worst-case evaluation'.
Authors: We agree this is a substantive limitation. Our metrics are derived exclusively from clean-model training dynamics and are therefore fixed before any poisoning occurs; they do not depend on the attack generation procedure. Nevertheless, the manuscript does not include experiments with an adaptive attacker who knows or optimizes against the metrics. We will revise the experiments and discussion sections to explicitly acknowledge this gap, clarify the scope of the current claims, and outline how the metrics could be used in future adaptive evaluations. revision: partial
-
Referee: [Abstract] Abstract: the assertion that 'experiments show these metrics reliably stratify samples by poisoning vulnerability' is presented without any reported correlation, AUC, success-rate deltas, error bars, or baseline comparisons, leaving the empirical support for the central claim difficult to evaluate.
Authors: The abstract is intentionally concise and defers quantitative details to the body of the paper, where we report stratification results via success-rate curves, distance-based groupings, and comparisons across poison budgets. To improve readability, we will revise the abstract to include one or two key quantitative indicators (e.g., AUC or average success-rate delta between vulnerable and robust strata) while remaining within length constraints. revision: yes
Circularity Check
No circularity: empirical stratification claims rest on experiments, not self-referential derivations
full rationale
The paper proposes metrics derived from clean training dynamics, poison distances, and budgets to rank samples by poisoning vulnerability, then validates the ranking via experiments on actual attack success. No equations, fitted parameters, or self-citations are presented in the provided text that reduce the central claim to an input by construction; the load-bearing step is an empirical observation rather than a definitional or predictive tautology. The work is self-contained against external benchmarks because success is measured by direct attack outcomes, not by re-deriving the metrics themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Targeted attacks leave no footprint at the distribution level
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce three predictive criteria for targeted data poisoning difficulty: ergodic prediction accuracy (analyzed through clean training dynamics), poison distance, and poison budget.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ε ≥ τ := max(⟨wp, g(Dc)⟩ / W(c−1/e), 0)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
[ADH22] C. Agarwal, D. D’souza, and S. Hooker. “Estimating example difficulty using variance of gradients”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 10368–10378. [AMW+21] H. Aghakhani, D. Meng, Y.-X. Wang, C. Kruegel, and G. Vigna. “Bullseye polytope: A scalable clean- label poisoning attack with impr...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Remarks on some nonparametric estimates of a density function
[DLP11] R. A. Davis, K.-S. Lii, and D. N. Politis. “Remarks on some nonparametric estimates of a density function”. In:Selected Works of Murray Rosenblatt. Springer, 2011, pp. 95–100. [FCG+21] L. Fowl, P.-y. Chiang, M. Goldblum, J. Geiping, A. Bansal, W. Czaja, and T. Goldstein. “Preventing unauthorizeduseofproprietarydata:Poisoningforsecuredatasetrelease...
-
[3]
Adversarial Examples Make Strong Poisons
[FGC+21] L. Fowl, M. Goldblum, P.-y. Chiang, J. Geiping, W. Czaja, and T. Goldstein. “Adversarial Examples Make Strong Poisons”. In:Advances in Neural Information Processing Systems. 2021, pp. 30339–30351. [FHL+21] S. Fu, F. He, Y. Liu, L. Shen, and D. Tao. “Robust unlearnable examples: Protecting data privacy against adversarial learning”. In:Internation...
work page 2021
-
[4]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
[GBB+20] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. arXiv preprint arXiv:2101.00027
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
[GDG17] T. Gu, B. Dolan-Gavitt, and S. Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. arXiv:1708.06733
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
A Neural Algorithm of Artistic Style
[GEB15] L. A. Gatys, A. S. Ecker, and M. Bethge. “A neural algorithm of artistic style”. arXiv preprint arXiv:1508.06576
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Practical Poisoning Attacks on Neural Networks
[GL20] J. Guo and C. Liu. “Practical Poisoning Attacks on Neural Networks”. In:European Conference on Computer Vision. 2020, pp. 142–158. [GTX+23] M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song, A. Madry, B. Li, and T. Goldstein. “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses”. IEEE Transactions...
work page 2020
-
[8]
Deep Residual Learning for Image Recognition
[HZRS16] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2016, pp. 770–
work page 2016
-
[9]
Understanding black-box predictions via influence functions
[KL17] P. W. Koh and P. Liang. “Understanding black-box predictions via influence functions”. In:Proceedings of the 34th International Conference on Machine Learning (ICML). 2017, pp. 1885–1894. [KNL+20] R. S. S. Kumar, M. Nyström, J. Lambert, A. Marshall, M. Goertzel, A. Comissoneru, M. Swann, and S. Xia. “Adversarial machine learning-industry perspectiv...
work page 2017
-
[10]
Stronger Data Poisoning Attacks Break Data Sanitization Defenses
[KSL22] P. W. Koh, J. Steinhardt, and P. Liang. “Stronger Data Poisoning Attacks Break Data Sanitization Defenses”.Machine Learning, vol. 111 (2022), pp. 1–47. [LC10] W. Liu and S. Chawla. “Mining adversarial patterns via regularized loss minimization”.Machine learn- ing, vol. 81, no. 1 (2010), pp. 69–83. [LKY22] Y. Lu, G. Kamath, and Y. Yu. “Indiscrimina...
work page 2022
-
[11]
On the robustness of neural networks quantization against data poisoning attacks
[LWZY24] Y. Lu, Y. Wang, G. Zhang, and Y. Yu. “On the robustness of neural networks quantization against data poisoning attacks”. In:ICML 2024 Next Generation of AI Safety Workshop
work page 2024
-
[12]
Tiny imagenet visual recognition challenge
[LY15] Y. Le and X. Yang. “Tiny imagenet visual recognition challenge”.CS 231N, vol. 7, no. 7 (2015), p
work page 2015
-
[13]
Indiscriminate data poisoning attacks on pre-trained feature extractors
[LYKY24] Y. Lu, M. Y. Yang, G. Kamath, and Y. Yu. “Indiscriminate data poisoning attacks on pre-trained feature extractors”. In:2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE. 2024, pp. 327–343. [LYL+24] Y. Lu, M. Y. Yang, Z. Liu, G. Kamath, and Y. Yu. “Disguised Copyright Infringement of Latent Diffusion Model”. In:Internat...
work page 2024
-
[14]
Threats tofederated learning: A survey
[LYY20] L. Lyu, H. Yu, and Q. Yang. “Threats tofederated learning: A survey”. arXiv preprint arXiv:2003.02133
-
[15]
Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization
[MBD+17] L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli. “Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization”. In:Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec). 2017, pp. 27–38. [RBL+22] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. O...
work page 2017
-
[16]
Selective classification via neural network training dynamics
13 [RTH+22] S. Rabanser, A. Thudi, K. Hamidieh, A. Dziedzic, and N. Papernot. “Selective classification via neural network training dynamics”.arXiv preprint arXiv:2205.13532(2022). [SDP+24] S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y. Zhao. “Nightshade: Prompt-specific poisoning attacks on text-to-image generative models”. In:2024 IEEE Symp...
-
[17]
[SHKR22] V. Shejwalkar, A. Houmansadr, P. Kairouz, and D. Ramage. “Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning”. In:IEEE Symposium on Security and Privacy (SP). 2022, pp. 1354–1371. [SHN+18] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein. “Poison Frogs! Targ...
work page 2022
-
[18]
Very Deep Convolutional Networks for Large-Scale Image Recognition
[SZ14] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition”. arXiv preprint arXiv:1409.1556(2014). [SZR+20] X. Sun, Z. Zhang, X. Ren, R. Luo, and L. Li. “Exploring the vulnerability of deep neural networks: A study of parameter corruption”. In:Proceedings of the AAAI Conference on Artificial Intelligence
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[19]
Microsoft chatbot is taught to swear on Twitter
[Wak16] J. Wakefield. “Microsoft chatbot is taught to swear on Twitter”.BBC News(2016). [WWSK23] A. Wan, E. Wallace, S. Shen, and D. Klein. “Poisoning language models during instruction tuning”. In: International Conference on Machine Learning. PMLR. 2023, pp. 35413–35425. [YZC+22] D. Yu, H. Zhang, W. Chen, J. Yin, and T.-Y. Liu. “Availability Attacks Cre...
work page 2016
-
[20]
Transferable clean-label poisoning attacks on deep neural nets
[ZHL+19] C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein. “Transferable clean-label poisoning attacks on deep neural nets”. In:International Conference on Machine Learning. 2019, pp. 7614–7623. 14 A Related works A.1 Data poisoning attacks Data poisoning, an emerging training-time concern in modern ML pipelines, refers to the threat of ...
work page 2019
-
[21]
We observe that by increasing the kernel size, the cirrus effect of the generated images dramatically decreases. When the kernel size is bigger than 10, the textual inversion model cannot learn any useful information. We conclude that preserving the structure ofxb is essential for a successful data poisoning attack, highlighting the role on the appearance...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.