pith. sign in

arxiv: 2509.06896 · v2 · pith:GBCB4IOSnew · submitted 2025-09-08 · 💻 cs.LG · stat.ML

Are Targeted Data Poisoning Attacks as Effective as We Think?

Pith reviewed 2026-05-25 07:38 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords targeted data poisoningattack evaluationvulnerability metricsworst-case analysismachine learning securityclean training dynamicsproactive defensepoison distances
0
0 comments X

The pith

Metrics from clean training dynamics and poison distances can stratify test samples by their vulnerability to targeted poisoning attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard evaluations of targeted data poisoning report average success rates over randomly chosen targets, which hides how much harder some samples are to poison than others. It shows that metrics computed only from clean model runs—training dynamics for coarse ranking and distances to candidate poisons for finer classification—can sort samples into easy and hard categories without running attacks. A sympathetic reader would care because this changes both how attack power is measured and how defenses can be applied: focus on the worst-case samples rather than the average, and protect the vulnerable ones proactively. The experiments indicate the rankings are stable enough to support these uses.

Core claim

Given a test dataset, the authors identify the easiest and hardest examples to poison using only clean model information: coarse evaluation via clean training dynamics and fine-grained classification via poison distances and budgets. Experiments show these metrics reliably stratify samples by poisoning vulnerability.

What carries the argument

Vulnerability stratification metrics computed from clean training dynamics and poison distances that rank samples without executing attacks.

If this is right

  • Evaluations of targeted poisoning attacks should report success on the hardest samples rather than averages over random targets.
  • Defenders can use the metrics to identify the most vulnerable samples in advance and apply countermeasures only where needed.
  • The same clean-only metrics support both worst-case attack assessment and proactive, sample-specific defense.
  • Reported attack success rates may need downward revision when restricted to the hardest targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clean-metric approach could be tested on non-targeted poisoning or backdoor settings to see whether the same dynamics predict vulnerability.
  • Periodic recomputation of the metrics during training might allow defenses to adapt as the model evolves.
  • Combining these rankings with other robustness signals could produce a broader vulnerability map for a given dataset.

Load-bearing premise

The rankings produced by clean dynamics and poison distances will still predict actual attack success when the attacker chooses poisons adaptively or when the model architecture or training procedure changes.

What would settle it

An experiment in which an adaptive attacker achieves comparable success rates on samples the metrics label as hard versus samples labeled as easy would falsify the stratification claim.

Figures

Figures reproduced from arXiv: 2509.06896 by Chenyu Zhang, Gautam Kamath, Matthew Y.R. Yang, William Xu, Yaoliang Yu, Yihan Wang, Yiwei Lu, Zuoqiu Liu.

Figure 1
Figure 1. Figure 1: Histogram of the attack success rate of gradient matching over 8 runs on attacking 100 test samples [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The attack success rates of gradient matching on CIFAR-10 on different poison classes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Measuring the poisoning difficulty of GM on CIFAR-10 (training from scratch) using [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: EPA for three test instances in the class “car”. Image (a): high EPA: 0.9988; ASR: 22.22%. Image (b): medium EPA: 0.6775; ASR: 90.28%. Image (c): low EPA: 0.0275; ASR: 98.61%. sample xt, and is ineffective on further ranking the poisoning difficulty within groups of targets with similar EPA. Training from scratch: For the from-scratch setting on ResNet-18/CIFAR-10, we perform clean train￾ing on Dc with the… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Correlation between pairwise δ difference and ASR difference; (b) and (c) Comparison between all of our metrics for low/high EPA samples. EPA, we apply the poisoning distance and the poison budget measure τ , where our experience suggests that a larger δ or a larger τ indicates a more difficult attack (lower ASR). Specifically, given a target sample xt, we would like to confirm whether δ and τ are capa… view at source ↗
Figure 6
Figure 6. Figure 6: Disguised copyrighted style on textual inversion with the original [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Disguised copyrighted style on textual inversion with the blurry [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Disguised copyrighted style on textual inversion with the blurry [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Disguised copyrighted style on textual inversion with the blurry [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Disguised copyrighted style on textual inversion with the blurry [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Disguised copyrighted style on textual inversion with a different choice of [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Targeted data poisoning attacks manipulate model predictions on specific test samples by injecting malicious data into training. Yet existing evaluations report average attack success rates over randomly selected targets, obscuring true worst-case effectiveness. We argue that the right evaluation focuses on the hardest samples to poison. The same reasoning applies to defense: since targeted attacks leave no footprint at the distribution level, defenders should proactively identify the most vulnerable samples and apply targeted countermeasures. Given a test dataset, this paper identifies both the easiest and hardest to poison examples based on only clean model information. Specifically, we offer coarse evaluations using clean training dynamics, and fine-grained classification on poison class using poison distances and budgets. Our experiments show these metrics reliably stratify samples by poisoning vulnerability, enabling both rigorous worst-case evaluation and proactive vulnerability-aware defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that targeted data poisoning evaluations should focus on the hardest-to-poison samples rather than average success rates over random targets. It proposes metrics derived from clean training dynamics for coarse evaluation and from poison distances and budgets for fine-grained classification to identify vulnerable samples. Experiments are claimed to demonstrate that these metrics reliably stratify samples by poisoning vulnerability, enabling rigorous worst-case evaluation and proactive vulnerability-aware defense.

Significance. If the metrics generalize, the work could shift poisoning research toward worst-case analysis and allow defenders to preemptively target vulnerable samples using only clean runs. The empirical focus on clean-dynamics metrics is a practical strength, but significance is reduced because the central stratification claim has not been tested against adaptive attackers or shifted training procedures.

major comments (2)
  1. [Experiments] Experiments section: stratification is demonstrated only for the paper's non-adaptive poison generation procedure; no ablation or test is provided for an adaptive attacker who selects poisons to maximize success while knowing or optimizing against the clean-dynamics metrics, directly undermining the claim of enabling 'rigorous worst-case evaluation'.
  2. [Abstract] Abstract: the assertion that 'experiments show these metrics reliably stratify samples by poisoning vulnerability' is presented without any reported correlation, AUC, success-rate deltas, error bars, or baseline comparisons, leaving the empirical support for the central claim difficult to evaluate.
minor comments (1)
  1. [Methods] The definitions of 'poison distances' and 'budgets' should be stated explicitly with formulas in the methods section to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, proposing revisions where they strengthen the work without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: stratification is demonstrated only for the paper's non-adaptive poison generation procedure; no ablation or test is provided for an adaptive attacker who selects poisons to maximize success while knowing or optimizing against the clean-dynamics metrics, directly undermining the claim of enabling 'rigorous worst-case evaluation'.

    Authors: We agree this is a substantive limitation. Our metrics are derived exclusively from clean-model training dynamics and are therefore fixed before any poisoning occurs; they do not depend on the attack generation procedure. Nevertheless, the manuscript does not include experiments with an adaptive attacker who knows or optimizes against the metrics. We will revise the experiments and discussion sections to explicitly acknowledge this gap, clarify the scope of the current claims, and outline how the metrics could be used in future adaptive evaluations. revision: partial

  2. Referee: [Abstract] Abstract: the assertion that 'experiments show these metrics reliably stratify samples by poisoning vulnerability' is presented without any reported correlation, AUC, success-rate deltas, error bars, or baseline comparisons, leaving the empirical support for the central claim difficult to evaluate.

    Authors: The abstract is intentionally concise and defers quantitative details to the body of the paper, where we report stratification results via success-rate curves, distance-based groupings, and comparisons across poison budgets. To improve readability, we will revise the abstract to include one or two key quantitative indicators (e.g., AUC or average success-rate delta between vulnerable and robust strata) while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical stratification claims rest on experiments, not self-referential derivations

full rationale

The paper proposes metrics derived from clean training dynamics, poison distances, and budgets to rank samples by poisoning vulnerability, then validates the ranking via experiments on actual attack success. No equations, fitted parameters, or self-citations are presented in the provided text that reduce the central claim to an input by construction; the load-bearing step is an empirical observation rather than a definitional or predictive tautology. The work is self-contained against external benchmarks because success is measured by direct attack outcomes, not by re-deriving the metrics themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that targeted poisons leave no detectable footprint at the distribution level and that clean-model signals are sufficient proxies for vulnerability. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Targeted attacks leave no footprint at the distribution level
    Stated explicitly in the abstract as the reason defenders must use per-sample rather than distribution-level detection.

pith-pipeline@v0.9.0 · 5682 in / 1201 out tokens · 16620 ms · 2026-05-25T07:38:07.923765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    [ADH22] C. Agarwal, D. D’souza, and S. Hooker. “Estimating example difficulty using variance of gradients”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 10368–10378. [AMW+21] H. Aghakhani, D. Meng, Y.-X. Wang, C. Kruegel, and G. Vigna. “Bullseye polytope: A scalable clean- label poisoning attack with impr...

  2. [2]

    Remarks on some nonparametric estimates of a density function

    [DLP11] R. A. Davis, K.-S. Lii, and D. N. Politis. “Remarks on some nonparametric estimates of a density function”. In:Selected Works of Murray Rosenblatt. Springer, 2011, pp. 95–100. [FCG+21] L. Fowl, P.-y. Chiang, M. Goldblum, J. Geiping, A. Bansal, W. Czaja, and T. Goldstein. “Preventing unauthorizeduseofproprietarydata:Poisoningforsecuredatasetrelease...

  3. [3]

    Adversarial Examples Make Strong Poisons

    [FGC+21] L. Fowl, M. Goldblum, P.-y. Chiang, J. Geiping, W. Czaja, and T. Goldstein. “Adversarial Examples Make Strong Poisons”. In:Advances in Neural Information Processing Systems. 2021, pp. 30339–30351. [FHL+21] S. Fu, F. He, Y. Liu, L. Shen, and D. Tao. “Robust unlearnable examples: Protecting data privacy against adversarial learning”. In:Internation...

  4. [4]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    [GBB+20] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”. arXiv preprint arXiv:2101.00027

  5. [5]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    [GDG17] T. Gu, B. Dolan-Gavitt, and S. Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. arXiv:1708.06733

  6. [6]

    A Neural Algorithm of Artistic Style

    [GEB15] L. A. Gatys, A. S. Ecker, and M. Bethge. “A neural algorithm of artistic style”. arXiv preprint arXiv:1508.06576

  7. [7]

    Practical Poisoning Attacks on Neural Networks

    [GL20] J. Guo and C. Liu. “Practical Poisoning Attacks on Neural Networks”. In:European Conference on Computer Vision. 2020, pp. 142–158. [GTX+23] M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song, A. Madry, B. Li, and T. Goldstein. “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses”. IEEE Transactions...

  8. [8]

    Deep Residual Learning for Image Recognition

    [HZRS16] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2016, pp. 770–

  9. [9]

    Understanding black-box predictions via influence functions

    [KL17] P. W. Koh and P. Liang. “Understanding black-box predictions via influence functions”. In:Proceedings of the 34th International Conference on Machine Learning (ICML). 2017, pp. 1885–1894. [KNL+20] R. S. S. Kumar, M. Nyström, J. Lambert, A. Marshall, M. Goertzel, A. Comissoneru, M. Swann, and S. Xia. “Adversarial machine learning-industry perspectiv...

  10. [10]

    Stronger Data Poisoning Attacks Break Data Sanitization Defenses

    [KSL22] P. W. Koh, J. Steinhardt, and P. Liang. “Stronger Data Poisoning Attacks Break Data Sanitization Defenses”.Machine Learning, vol. 111 (2022), pp. 1–47. [LC10] W. Liu and S. Chawla. “Mining adversarial patterns via regularized loss minimization”.Machine learn- ing, vol. 81, no. 1 (2010), pp. 69–83. [LKY22] Y. Lu, G. Kamath, and Y. Yu. “Indiscrimina...

  11. [11]

    On the robustness of neural networks quantization against data poisoning attacks

    [LWZY24] Y. Lu, Y. Wang, G. Zhang, and Y. Yu. “On the robustness of neural networks quantization against data poisoning attacks”. In:ICML 2024 Next Generation of AI Safety Workshop

  12. [12]

    Tiny imagenet visual recognition challenge

    [LY15] Y. Le and X. Yang. “Tiny imagenet visual recognition challenge”.CS 231N, vol. 7, no. 7 (2015), p

  13. [13]

    Indiscriminate data poisoning attacks on pre-trained feature extractors

    [LYKY24] Y. Lu, M. Y. Yang, G. Kamath, and Y. Yu. “Indiscriminate data poisoning attacks on pre-trained feature extractors”. In:2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE. 2024, pp. 327–343. [LYL+24] Y. Lu, M. Y. Yang, Z. Liu, G. Kamath, and Y. Yu. “Disguised Copyright Infringement of Latent Diffusion Model”. In:Internat...

  14. [14]

    Threats tofederated learning: A survey

    [LYY20] L. Lyu, H. Yu, and Q. Yang. “Threats tofederated learning: A survey”. arXiv preprint arXiv:2003.02133

  15. [15]

    Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization

    [MBD+17] L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli. “Towards Poisoning of Deep Learning Algorithms with Back-gradient Optimization”. In:Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec). 2017, pp. 27–38. [RBL+22] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. O...

  16. [16]

    Selective classification via neural network training dynamics

    13 [RTH+22] S. Rabanser, A. Thudi, K. Hamidieh, A. Dziedzic, and N. Papernot. “Selective classification via neural network training dynamics”.arXiv preprint arXiv:2205.13532(2022). [SDP+24] S. Shan, W. Ding, J. Passananti, S. Wu, H. Zheng, and B. Y. Zhao. “Nightshade: Prompt-specific poisoning attacks on text-to-image generative models”. In:2024 IEEE Symp...

  17. [17]

    Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning

    [SHKR22] V. Shejwalkar, A. Houmansadr, P. Kairouz, and D. Ramage. “Back to the Drawing Board: A Critical Evaluation of Poisoning Attacks on Production Federated Learning”. In:IEEE Symposium on Security and Privacy (SP). 2022, pp. 1354–1371. [SHN+18] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein. “Poison Frogs! Targ...

  18. [18]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    [SZ14] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image recognition”. arXiv preprint arXiv:1409.1556(2014). [SZR+20] X. Sun, Z. Zhang, X. Ren, R. Luo, and L. Li. “Exploring the vulnerability of deep neural networks: A study of parameter corruption”. In:Proceedings of the AAAI Conference on Artificial Intelligence

  19. [19]

    Microsoft chatbot is taught to swear on Twitter

    [Wak16] J. Wakefield. “Microsoft chatbot is taught to swear on Twitter”.BBC News(2016). [WWSK23] A. Wan, E. Wallace, S. Shen, and D. Klein. “Poisoning language models during instruction tuning”. In: International Conference on Machine Learning. PMLR. 2023, pp. 35413–35425. [YZC+22] D. Yu, H. Zhang, W. Chen, J. Yin, and T.-Y. Liu. “Availability Attacks Cre...

  20. [20]

    Transferable clean-label poisoning attacks on deep neural nets

    [ZHL+19] C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein. “Transferable clean-label poisoning attacks on deep neural nets”. In:International Conference on Machine Learning. 2019, pp. 7614–7623. 14 A Related works A.1 Data poisoning attacks Data poisoning, an emerging training-time concern in modern ML pipelines, refers to the threat of ...

  21. [21]

    When the kernel size is bigger than 10, the textual inversion model cannot learn any useful information

    We observe that by increasing the kernel size, the cirrus effect of the generated images dramatically decreases. When the kernel size is bigger than 10, the textual inversion model cannot learn any useful information. We conclude that preserving the structure ofxb is essential for a successful data poisoning attack, highlighting the role on the appearance...