pith. sign in

arxiv: 2510.07985 · v3 · submitted 2025-10-09 · 💻 cs.LG · cs.AI· cs.CR

Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Pith reviewed 2026-05-18 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords LLM pruningadversarial attackmodel compressionjailbreakinference securityweight sparsificationdeployment vulnerability
0
0 comments X

The pith

Attackers can craft LLMs that stay benign until pruned, then turn malicious.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that an adversary can build a large language model which performs normally and passes inspection before any compression steps. It relies on a proxy calculation to predict which weights a chosen pruning method will delete. Malicious behaviors are embedded in weights expected to survive pruning while removable weights are adjusted to cancel those behaviors in the full model. If the approach works, then downloading and pruning models from untrusted sources can silently introduce jailbreaks, refusals of normal instructions, or targeted content injection. This matters because pruning has become a standard way to reduce memory use in inference engines, creating a new point where security can be compromised without the user's knowledge.

Core claim

Modern LLM pruning methods can be maliciously exploited by constructing a model that appears benign yet, once pruned, exhibits malicious behaviors. The adversary computes a proxy metric estimating how likely each parameter is to be pruned, injects malicious behavior into parameters unlikely to be pruned, and repairs the model using parameters likely to be pruned to cancel out the injected behavior in the unpruned model. Extensive evaluation on five models shows that after pruning with Magnitude, Wanda, or SparseGPT in vLLM, the resulting models display strong malicious behaviors across attack scenarios, with success rates reaching 95.7% for jailbreak, 98.7% for benign instruction refusal, 99

What carries the argument

A proxy metric that estimates how likely each parameter is to be pruned by the victim's chosen method.

If this is right

  • Pruning in inference engines no longer preserves safety properties when the model source is untrusted.
  • Safety alignments can be bypassed at deployment time without changing the original model visibly.
  • Users who prune downloaded models face risks of jailbreaks and content injection after compression.
  • Verification of model behavior must extend to pruned versions rather than stopping at the full model.
  • Model compression pipelines require new checks to detect adversarial preparation before pruning occurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy-based hiding technique could be tested against other compression approaches such as quantization.
  • Inference engines might add post-pruning behavioral tests on a small set of safety prompts to catch activated malice.
  • Model repositories could require provenance checks or signed pruning configurations to reduce supply-chain exposure.
  • This attack surface suggests that future pruning research should include adversarial robustness as a standard evaluation criterion.

Load-bearing premise

The proxy metric for guessing which weights will be pruned must be accurate enough that the malicious injection stays hidden and the unpruned model shows no obvious problems.

What would settle it

A case where the proxy prediction deviates enough from actual pruning that the hidden malicious behaviors fail to appear after compression or become detectable in the original model.

read the original abstract

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that an adversary can modify an LLM so that it appears benign in its unpruned state but reliably exhibits malicious behaviors (jailbreaks, benign-instruction refusal, targeted content injection) once standard pruning methods (Magnitude, Wanda, SparseGPT) from vLLM are applied. The attack uses a proxy metric to predict per-weight pruning likelihood, injects the malicious payload into weights unlikely to be pruned, and applies a canceling repair in weights likely to be pruned. Experiments on five models report attack success rates up to 95.7% (jailbreak), 98.7% (refusal), and 99.5% (content injection) after pruning.

Significance. If the proxy accurately anticipates the victim's pruning decisions, the result demonstrates a practical, deployment-time security vulnerability in widely used model-compression pipelines. The concrete success rates across multiple models and pruning methods constitute falsifiable empirical evidence that can be independently verified. The work therefore supplies a clear, testable warning about the security of pruned LLMs that practitioners should address.

major comments (2)
  1. Section 3: The attack's correctness rests on the proxy metric closely matching the actual weights removed by the victim's chosen pruning rule (Magnitude, Wanda, or SparseGPT). No quantitative overlap statistics, precision-recall figures, or ablation on proxy error (different sparsity targets, calibration data, or implementation details) are reported. Without such measurements, it remains unclear whether the high post-pruning success rates would survive realistic mismatches between the adversary's proxy and the user's pruning procedure.
  2. Section 4 / experimental setup: All reported success rates appear to be measured after the authors apply their own proxy-guided modifications followed by the pruning methods. A direct test of the proxy-versus-actual pruning overlap on held-out calibration data or under varied hyper-parameters would be required to establish that the attack transfers to an independent victim implementation.
minor comments (2)
  1. The abstract and introduction should explicitly list the five evaluated models (names and sizes) rather than deferring the information.
  2. Figure captions and axis labels for the success-rate plots should state the exact sparsity ratio and calibration dataset used for each pruning method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments on our manuscript. We appreciate the emphasis on the importance of validating the proxy metric's accuracy and transferability. We provide point-by-point responses below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: Section 3: The attack's correctness rests on the proxy metric closely matching the actual weights removed by the victim's chosen pruning rule (Magnitude, Wanda, or SparseGPT). No quantitative overlap statistics, precision-recall figures, or ablation on proxy error (different sparsity targets, calibration data, or implementation details) are reported. Without such measurements, it remains unclear whether the high post-pruning success rates would survive realistic mismatches between the adversary's proxy and the user's pruning procedure.

    Authors: We thank the referee for highlighting this important aspect. Our proxy metric is derived directly from the pruning criteria used by each method (e.g., weight magnitudes for Magnitude pruning, activation-based scores for Wanda, and Hessian approximations for SparseGPT), computed on a calibration dataset. This design ensures a close match by construction. However, we acknowledge that explicit quantitative measures such as overlap statistics and precision-recall curves were not included in the original submission. We will add these analyses, including ablations on different sparsity levels and calibration sets, to the revised manuscript to demonstrate the proxy's fidelity. revision: yes

  2. Referee: Section 4 / experimental setup: All reported success rates appear to be measured after the authors apply their own proxy-guided modifications followed by the pruning methods. A direct test of the proxy-versus-actual pruning overlap on held-out calibration data or under varied hyper-parameters would be required to establish that the attack transfers to an independent victim implementation.

    Authors: We agree that testing under varied conditions is crucial for demonstrating practicality. In our setup, we followed the standard implementations and default hyperparameters from vLLM for each pruning method. To further validate transferability, we will include additional experiments using held-out calibration data and varied hyperparameters (e.g., different sparsity targets and calibration dataset sizes) in the revision. These will report the overlap between proxy predictions and actual pruned weights, confirming that the attack remains effective even with minor implementation differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attack success measured directly after pruning

full rationale

The paper constructs an attack by computing a proxy metric from the known pruning rules (Magnitude, Wanda, SparseGPT) to select weights for injection versus repair, then evaluates the resulting malicious behavior after applying the actual pruning methods to the modified models. The reported success rates are direct empirical measurements on five models rather than quantities derived by fitting parameters to match those rates or by reducing equations to self-defined inputs. No load-bearing self-citations, ansatzes, or uniqueness theorems are invoked to force the outcomes; the proxy is first-principles estimation and the results remain externally falsifiable by reproduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The attack rests on the existence of a reliable proxy for pruning decisions and the ability to solve for a repair that cancels the payload in the dense model. No new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption Pruning decisions can be approximated by a computable proxy metric derived from the model weights and the pruning algorithm.
    Central to choosing which weights receive the payload versus the repair.

pith-pipeline@v0.9.0 · 5828 in / 1279 out tokens · 34261 ms · 2026-05-18T08:48:13.858920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.