Fewer Weights, More Problems: A Practical Attack on LLM Pruning
Pith reviewed 2026-05-18 08:48 UTC · model grok-4.3
The pith
Attackers can craft LLMs that stay benign until pruned, then turn malicious.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modern LLM pruning methods can be maliciously exploited by constructing a model that appears benign yet, once pruned, exhibits malicious behaviors. The adversary computes a proxy metric estimating how likely each parameter is to be pruned, injects malicious behavior into parameters unlikely to be pruned, and repairs the model using parameters likely to be pruned to cancel out the injected behavior in the unpruned model. Extensive evaluation on five models shows that after pruning with Magnitude, Wanda, or SparseGPT in vLLM, the resulting models display strong malicious behaviors across attack scenarios, with success rates reaching 95.7% for jailbreak, 98.7% for benign instruction refusal, 99
What carries the argument
A proxy metric that estimates how likely each parameter is to be pruned by the victim's chosen method.
If this is right
- Pruning in inference engines no longer preserves safety properties when the model source is untrusted.
- Safety alignments can be bypassed at deployment time without changing the original model visibly.
- Users who prune downloaded models face risks of jailbreaks and content injection after compression.
- Verification of model behavior must extend to pruned versions rather than stopping at the full model.
- Model compression pipelines require new checks to detect adversarial preparation before pruning occurs.
Where Pith is reading between the lines
- The same proxy-based hiding technique could be tested against other compression approaches such as quantization.
- Inference engines might add post-pruning behavioral tests on a small set of safety prompts to catch activated malice.
- Model repositories could require provenance checks or signed pruning configurations to reduce supply-chain exposure.
- This attack surface suggests that future pruning research should include adversarial robustness as a standard evaluation criterion.
Load-bearing premise
The proxy metric for guessing which weights will be pruned must be accurate enough that the malicious injection stays hidden and the unpruned model shows no obvious problems.
What would settle it
A case where the proxy prediction deviates enough from actual pruning that the hidden malicious behaviors fail to appear after compression or become detectable in the original model.
read the original abstract
Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that an adversary can modify an LLM so that it appears benign in its unpruned state but reliably exhibits malicious behaviors (jailbreaks, benign-instruction refusal, targeted content injection) once standard pruning methods (Magnitude, Wanda, SparseGPT) from vLLM are applied. The attack uses a proxy metric to predict per-weight pruning likelihood, injects the malicious payload into weights unlikely to be pruned, and applies a canceling repair in weights likely to be pruned. Experiments on five models report attack success rates up to 95.7% (jailbreak), 98.7% (refusal), and 99.5% (content injection) after pruning.
Significance. If the proxy accurately anticipates the victim's pruning decisions, the result demonstrates a practical, deployment-time security vulnerability in widely used model-compression pipelines. The concrete success rates across multiple models and pruning methods constitute falsifiable empirical evidence that can be independently verified. The work therefore supplies a clear, testable warning about the security of pruned LLMs that practitioners should address.
major comments (2)
- Section 3: The attack's correctness rests on the proxy metric closely matching the actual weights removed by the victim's chosen pruning rule (Magnitude, Wanda, or SparseGPT). No quantitative overlap statistics, precision-recall figures, or ablation on proxy error (different sparsity targets, calibration data, or implementation details) are reported. Without such measurements, it remains unclear whether the high post-pruning success rates would survive realistic mismatches between the adversary's proxy and the user's pruning procedure.
- Section 4 / experimental setup: All reported success rates appear to be measured after the authors apply their own proxy-guided modifications followed by the pruning methods. A direct test of the proxy-versus-actual pruning overlap on held-out calibration data or under varied hyper-parameters would be required to establish that the attack transfers to an independent victim implementation.
minor comments (2)
- The abstract and introduction should explicitly list the five evaluated models (names and sizes) rather than deferring the information.
- Figure captions and axis labels for the success-rate plots should state the exact sparsity ratio and calibration dataset used for each pruning method.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments on our manuscript. We appreciate the emphasis on the importance of validating the proxy metric's accuracy and transferability. We provide point-by-point responses below and indicate where revisions will be made.
read point-by-point responses
-
Referee: Section 3: The attack's correctness rests on the proxy metric closely matching the actual weights removed by the victim's chosen pruning rule (Magnitude, Wanda, or SparseGPT). No quantitative overlap statistics, precision-recall figures, or ablation on proxy error (different sparsity targets, calibration data, or implementation details) are reported. Without such measurements, it remains unclear whether the high post-pruning success rates would survive realistic mismatches between the adversary's proxy and the user's pruning procedure.
Authors: We thank the referee for highlighting this important aspect. Our proxy metric is derived directly from the pruning criteria used by each method (e.g., weight magnitudes for Magnitude pruning, activation-based scores for Wanda, and Hessian approximations for SparseGPT), computed on a calibration dataset. This design ensures a close match by construction. However, we acknowledge that explicit quantitative measures such as overlap statistics and precision-recall curves were not included in the original submission. We will add these analyses, including ablations on different sparsity levels and calibration sets, to the revised manuscript to demonstrate the proxy's fidelity. revision: yes
-
Referee: Section 4 / experimental setup: All reported success rates appear to be measured after the authors apply their own proxy-guided modifications followed by the pruning methods. A direct test of the proxy-versus-actual pruning overlap on held-out calibration data or under varied hyper-parameters would be required to establish that the attack transfers to an independent victim implementation.
Authors: We agree that testing under varied conditions is crucial for demonstrating practicality. In our setup, we followed the standard implementations and default hyperparameters from vLLM for each pruning method. To further validate transferability, we will include additional experiments using held-out calibration data and varied hyperparameters (e.g., different sparsity targets and calibration dataset sizes) in the revision. These will report the overlap between proxy predictions and actual pruned weights, confirming that the attack remains effective even with minor implementation differences. revision: yes
Circularity Check
No significant circularity: empirical attack success measured directly after pruning
full rationale
The paper constructs an attack by computing a proxy metric from the known pruning rules (Magnitude, Wanda, SparseGPT) to select weights for injection versus repair, then evaluates the resulting malicious behavior after applying the actual pruning methods to the modified models. The reported success rates are direct empirical measurements on five models rather than quantities derived by fitting parameters to match those rates or by reducing equations to self-defined inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked to force the outcomes; the proxy is first-principles estimation and the results remain externally falsifiable by reproduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pruning decisions can be approximated by a computable proxy metric derived from the model weights and the pruning algorithm.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.