arxiv: 2605.08137 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI· cs.CY

Recognition: no theorem link

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

Plawan Kumar Rath, Rahul Maliakkal

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY

keywords LLM pruningbias amplificationmodel compressionedge AIstereotypical biasperplexity evaluationBBQ benchmarkfairness in compression

0 comments

The pith

Activation-aware pruning preserves perplexity but produces the largest bias increases in compressed LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled study of three instruction-tuned models pruned by random, magnitude, and activation-aware methods at multiple sparsity levels, measuring both language modeling quality and bias on a large set of BBQ benchmark items. It finds that the pruning technique best at keeping perplexity low also drives the biggest rises in stereotypical responses and new biased behaviors. This matters for edge AI because deployments often select compression methods based solely on performance metrics that fail to detect fairness degradation. The study further reports that unstructured pruning yields no storage or speed improvements on real hardware, removing the main reason for using it in IoT settings. These results indicate that standard validation pipelines can miss substantial shifts in model alignment.

Core claim

The paper establishes a Smart Pruning Paradox in which activation-aware pruning (Wanda) maintains near-original perplexity (only 3.5 percent increase at 50 percent sparsity) yet produces the highest bias amplification, with Stereotype Reliance Score rising 83.7 percent and 47-59 percent of previously unbiased items developing new stereotypical behaviors at 70 percent sparsity, while random pruning destroys language capability but keeps bias near chance levels.

What carries the argument

The empirical comparison of Random, Magnitude, and Wanda pruning across sparsity levels, tracked via perplexity, Stereotype Reliance Score, and rates of bias-state transitions on the BBQ benchmark.

If this is right

Perplexity-based evaluation gives false assurance of behavioral equivalence after pruning.
Pruning produces bias transition rates nearly three times higher than those reported for quantization.
Unstructured pruning supplies zero storage savings and zero inference latency reduction on actual edge hardware.
IoT deployment pipelines must incorporate bias-aware validation before releasing pruned models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings imply that capability-preserving compression methods may systematically interact with alignment in ways that performance-only checks cannot detect.
Future experiments could test whether structured pruning methods avoid the same bias amplification while still delivering hardware benefits.
The high rate of new stereotypical behaviors suggests that pruning may be a stronger disruptor of prior safety training than other compression techniques.

Load-bearing premise

The three tested models and the BBQ benchmark items are representative enough for the observed bias amplification to appear in other models, tasks, and real-world deployments.

What would settle it

A replication on a wider range of models or a different bias benchmark that finds no consistent rise in Stereotype Reliance Score or new stereotypical behaviors after activation-aware pruning would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.08137 by Plawan Kumar Rath, Rahul Maliakkal.

**Figure 2.** Figure 2: Evaluation gap: SRS percentage change (blue) vs. perplexity percent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Grouped bar chart showing SRS at 50% sparsity by model and pruning [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: USR vs. sparsity level for each model, with lines colored by pruning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: SRS by bias category for each model, faceted by pruning method. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of all-items vs. latent-bias-filtered items SRS trajectories. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Weight pruning is widely advocated for deploying Large Language Models on resource-constrained IoT and edge devices, yet its impact on model fairness remains poorly understood. We conduct a controlled empirical study of three instruction-tuned models (Gemma-2-9b-it, Mistral-7B-Instruct-v0.3, Phi-3.5-mini-instruct) across three pruning methods (Random, Magnitude, Wanda) at four sparsity levels (10-70%) on 12,148 BBQ bias benchmark items with 5 random seeds, totaling 2,368,860 inference records. Our results reveal a Smart Pruning Paradox: activation-aware pruning (Wanda) preserves perplexity nearly perfectly (just 3.5% increase at 50% sparsity for Mistral-7B), yet produces the highest bias amplification, with Stereotype Reliance Score increasing 83.7% and 47-59% of previously unbiased items developing new stereotypical behaviors at 70% sparsity. Random pruning destroys language capability entirely (perplexity exceeding $10^4$ and reaching $10^8$) but produces only random-chance bias. We further show that unstructured pruning provides zero storage savings and zero inference latency reduction on real edge hardware, undermining the primary motivation for its use in IoT deployment. Of 180 dense-vs-pruned comparisons, 141 (78.3%) are significant ($p < 0.05$) with mean $|h| = 0.305$. Published quantization studies report up to 21% of responses flipping between biased and unbiased states; our pruning results show transition rates nearly three times higher (47-59%), suggesting pruning poses a categorically greater risk to alignment than quantization. These findings demonstrate that perplexity-based evaluation provides false assurance of behavioral equivalence, and that IoT deployment pipelines require bias-aware validation before deploying pruned models at the edge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation-aware pruning like Wanda keeps perplexity nearly flat but drives larger bias increases on BBQ than random pruning, based on runs across three models.

read the letter

The core finding is that smarter pruning methods preserve language modeling scores while amplifying bias measures more than naive ones. Wanda on Mistral-7B shows only a 3.5% perplexity rise at 50% sparsity yet an 83.7% jump in stereotype reliance, with 47-59% of items flipping to new biased outputs at higher sparsity. Random pruning tanks perplexity into the millions but leaves bias near chance levels. The study runs this across Gemma-2-9b-it, Mistral-7B-Instruct, and Phi-3.5-mini-instruct at four sparsity levels on 12k BBQ items with five seeds, yielding 78% significant comparisons.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a controlled empirical study examining the effects of weight pruning on bias in large language models intended for edge AI applications. It evaluates three instruction-tuned models—Gemma-2-9b-it, Mistral-7B-Instruct-v0.3, and Phi-3.5-mini-instruct—using three pruning techniques (Random, Magnitude, and Wanda) at sparsity levels from 10% to 70%. The study utilizes the BBQ bias benchmark with 12,148 items, five random seeds, and reports on 2,368,860 inference records. Key findings include the 'Smart Pruning Paradox' where activation-aware pruning (Wanda) maintains near-original perplexity (e.g., 3.5% increase at 50% sparsity for Mistral-7B) but causes the largest bias amplification, including an 83.7% increase in Stereotype Reliance Score and 47-59% of items developing new stereotypical behaviors at 70% sparsity. In contrast, random pruning severely degrades perplexity but results in bias levels consistent with chance. The paper also claims that unstructured pruning yields no practical storage or latency benefits on edge hardware and that bias transition rates are nearly three times higher than those reported in quantization studies.

Significance. If the observed patterns hold, this work has significant implications for the deployment of compressed LLMs in fairness-sensitive applications on resource-limited devices. The large-scale experimental design, involving multiple models, methods, sparsity levels, and seeds, with 78.3% of 180 comparisons reaching statistical significance (mean |h| = 0.305), provides robust evidence that standard perplexity metrics can mask substantial changes in model behavior, particularly regarding bias. The explicit comparison to quantization and the hardware evaluation add practical relevance. This contributes empirical data to the discussion on model compression trade-offs, highlighting the need for bias-aware evaluation in pruning pipelines.

major comments (2)

[Results (hardware evaluation)] Results section on hardware evaluation: The claim that unstructured pruning provides zero storage savings and zero inference latency reduction on real edge hardware is central to critiquing the motivation for pruning in IoT settings. However, the specific hardware platform, measurement tools (e.g., profiling libraries), and exact comparison metrics to dense baselines are not detailed sufficiently to allow independent verification or assessment of generalizability across edge devices.
[Discussion] Discussion: The assertion that pruning poses a 'categorically greater risk to alignment than quantization' is supported by the higher transition rates (47-59% vs. up to 21% in published studies), but this relies on cross-study comparison without controlling for model, benchmark, or task differences. A direct head-to-head experiment on the same models and BBQ items would be needed to substantiate the categorical framing.

minor comments (3)

[Abstract] Abstract: The Stereotype Reliance Score is referenced without a concise definition or pointer to its computation formula, which may hinder quick comprehension for readers unfamiliar with the metric.
[Experimental setup] Experimental setup: While five random seeds are used, the manuscript should explicitly state how seed variability is aggregated in the reported means and significance tests for all bias metrics.
[Figures] Figures: Bias transition plots would benefit from including per-seed variability (e.g., error bars) to visually convey the robustness of the 47-59% new stereotypical behavior rates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment point by point below, with planned changes to the manuscript where appropriate.

read point-by-point responses

Referee: [Results (hardware evaluation)] Results section on hardware evaluation: The claim that unstructured pruning provides zero storage savings and zero inference latency reduction on real edge hardware is central to critiquing the motivation for pruning in IoT settings. However, the specific hardware platform, measurement tools (e.g., profiling libraries), and exact comparison metrics to dense baselines are not detailed sufficiently to allow independent verification or assessment of generalizability across edge devices.

Authors: We agree that the hardware evaluation section requires greater specificity for reproducibility. In the revised manuscript we will expand this subsection to explicitly state the edge hardware platform used, the measurement tools and libraries employed for profiling storage and latency, and the exact comparison metrics (model size in bytes and end-to-end inference latency in milliseconds) against dense baselines. These additions will substantiate why unstructured pruning yields no practical benefits on the tested devices, which lack native sparse acceleration. revision: yes
Referee: [Discussion] Discussion: The assertion that pruning poses a 'categorically greater risk to alignment than quantization' is supported by the higher transition rates (47-59% vs. up to 21% in published studies), but this relies on cross-study comparison without controlling for model, benchmark, or task differences. A direct head-to-head experiment on the same models and BBQ items would be needed to substantiate the categorical framing.

Authors: We acknowledge the inherent limitations of cross-study comparisons. While a controlled head-to-head experiment on identical models and items would be ideal, it lies outside the scope of the present study. We will revise the Discussion to remove the 'categorically greater risk' phrasing, instead reporting that bias transition rates under pruning are substantially higher than those in the cited quantization literature while explicitly noting the uncontrolled differences in models, benchmarks, and tasks. This preserves the empirical observation without overstating the comparative claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pure empirical measurement study

full rationale

This is a controlled empirical study that runs inference on three fixed models with three pruning methods at multiple sparsity levels, computes perplexity and Stereotype Reliance Scores on the BBQ benchmark items, and reports statistical comparisons. No equations, parameters, or derivations are present that could reduce any reported result to a fitted input or self-citation. All claims rest on new experimental data collection (2.3M+ records) and direct measurement rather than any self-referential construction. Self-citations, if any, are not load-bearing for the central observations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the BBQ benchmark as a bias measure and on standard statistical testing; no free parameters or new entities are introduced.

axioms (2)

domain assumption The BBQ benchmark provides a reliable proxy for stereotypical bias in model outputs
All bias amplification claims are measured against this benchmark
standard math Statistical significance at p<0.05 with the reported effect size is sufficient to establish the observed differences
Used to support 141 of 180 comparisons

pith-pipeline@v0.9.0 · 5655 in / 1312 out tokens · 57547 ms · 2026-05-12T02:43:52.164182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Large Language Model Deployment on Resource-Constrained Edge Devices: A Practitioner’s Survey,

R. Maliakkal, Y . Makin, P. Rath, R. Jain, and A. Sadhoo, “Large Language Model Deployment on Resource-Constrained Edge Devices: A Practitioner’s Survey,” inProc. IEEE 16th Annu. Computing and Communication Workshop and Conf. (CCWC), 2026

work page 2026
[2]

Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency,

B. Aregawi, X. Zhang,et al., “Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency,”ACM Transactions on Internet of Things, 2025

work page 2025
[3]

Large Language Models: A Survey

S. Minaee, T. Mikolov,et al., “Large Language Models: A Survey,” arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A Survey of Model Compression Techniques: Past, Present, and Future,

Z. Liaoet al., “A Survey of Model Compression Techniques: Past, Present, and Future,”Frontiers in Robotics and AI, vol. 12, 2025

work page 2025
[5]

A Survey on Model Com- pression for Large Language Models,

X. Zhu, J. Li, Y . Liu, C. Ma, and W. Wang, “A Survey on Model Com- pression for Large Language Models,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 1556–1577, 2024

work page 2024
[6]

A Simple and Effective Pruning Approach for Large Language Models,

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A Simple and Effective Pruning Approach for Large Language Models,” inProc. ICLR, 2024

work page 2024
[7]

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,

E. Frantar and D. Alistarh, “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,” inProc. ICML, 2023

work page 2023
[8]

What Do Compressed Deep Neural Networks Forget?

S. Hooker, A. Courville, G. Clark, Y . Dauphin, and A. Frome, “What Do Compressed Deep Neural Networks Forget?”arXiv preprint arXiv:1911.05248, 2019

work page arXiv 1911
[9]

Charac- terising Bias in Compressed Models,

S. Hooker, N. Moorosi, G. Clark, S. Bengio, and E. Denton, “Charac- terising Bias in Compressed Models,”arXiv preprint arXiv:2010.03058, 2020

work page arXiv 2010
[10]

Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression,

Z. Xu, A. Gupta, T. Li, O. Bentham, and V . Srikumar, “Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression,” inFindings of EMNLP, 2024

work page 2024
[11]

Learning Both Weights and Connections for Efficient Neural Networks,

S. Han, J. Pool, J. Tran, and W. Dally, “Learning Both Weights and Connections for Efficient Neural Networks,” inProc. NeurIPS, pp. 1135–1143, 2015

work page 2015
[12]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,

S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” inProc. ICLR, 2016

work page 2016
[13]

A Fast Post-Training Pruning Framework for Transformers,

W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gho- lami, “A Fast Post-Training Pruning Framework for Transformers,” in Proc. NeurIPS, 2022

work page 2022
[14]

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks,

J. Frankle and M. Carlin, “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks,” inProc. ICLR, 2019

work page 2019
[15]

Bias and Fairness in Large Language Models: A Survey,

I. O. Gallegos, R. A. Rossi,et al., “Bias and Fairness in Large Language Models: A Survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097– 1179, 2024

work page 2024
[16]

BBQ: A Hand-Built Bias Benchmark for Question Answering,

A. Parrish, A. Chen, N. Nangia,et al., “BBQ: A Hand-Built Bias Benchmark for Question Answering,” inFindings of ACL, pp. 2086– 2105, 2022

work page 2086
[17]

Pruning Has a Disparate Impact on Model Accuracy,

C. Tran, F. Fioretto, J.-E. Kim, and R. Naidu, “Pruning Has a Disparate Impact on Model Accuracy,” inProc. NeurIPS, 2022

work page 2022
[18]

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures,

E. Iofinova, A. Peste, and D. Alistarh, “Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures,” inProc. CVPR, pp. 24364– 24373, 2023

work page 2023
[19]

The Other Side of Compres- sion: Measuring Bias in Pruned Transformers,

I. Proskurina, G. Metzler, and J. Velcin, “The Other Side of Compres- sion: Measuring Bias in Pruned Transformers,” inProc. IDA, 2023

work page 2023
[20]

Decoding Compressed Trust: Scru- tinizing the Trustworthiness of Efficient LLMs Under Compression,

J. Hong, J. Duan, C. Zhang,et al., “Decoding Compressed Trust: Scru- tinizing the Trustworthiness of Efficient LLMs Under Compression,” in Proc. ICML, 2024

work page 2024
[21]

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications,

B. Wei, K. Huang, Y . Huang,et al., “Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications,” inProc. ICML, pp. 52588–52610, 2024

work page 2024
[22]

A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models,

K. Ramesh, A. Chavan, S. Pandit, and S. Sitaram, “A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models,” inProc. ACL, pp. 15762–15782, 2023

work page 2023
[23]

Accuracy is Not All You Need,

S. Dutta, A. Pandey, S. Chattopadhyay, T. Sinha, and S. Chakraborty, “Accuracy is Not All You Need,”arXiv preprint arXiv:2407.09141, 2024

work page arXiv 2024
[24]

Uncertainty Drives Social Bias Changes in Quantized Large Language Models,

S. Z. Hua, S. Lotfi, and I. Y . Chen, “Uncertainty Drives Social Bias Changes in Quantized Large Language Models,”arXiv preprint arXiv:2602.06181, 2026

work page arXiv 2026
[25]

Efficient Large Language Models: A Survey,

Z. Wan, X. Wang, C. Liu,et al., “Efficient Large Language Models: A Survey,”Transactions on Machine Learning Research, 2024

work page 2024
[26]

Wanda++: Pruning Large Language Models via Regional Gradients,

Y . Yang, K. Zhen, B. Ganesh, A. Galstyan,et al., “Wanda++: Pruning Large Language Models via Regional Gradients,” inFindings of ACL, pp. 4321–4333, 2025

work page 2025
[27]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,

C. Raffel, N. Shazeer, A. Roberts,et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,”JMLR, vol. 21, pp. 1–67, 2020

work page 2020
[28]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

work page 1988
[29]

MLX: An Array Framework for Apple Silicon,

Apple Inc., “MLX: An Array Framework for Apple Silicon,” GitHub, 2023

work page 2023
[30]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,

B. Wang, W. Chen, H. Pei,et al., “DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models,” inProc. NeurIPS, 2023

work page 2023
[31]

Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models,

V . Kharinaevet al., “Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models,”arXiv preprint arXiv:2502.15799, 2025

work page arXiv 2025
[32]

Less Is More? Examining Fairness in Pruned Large Language Models for Summarizing Opinions,

P. Huanget al., “Less Is More? Examining Fairness in Pruned Large Language Models for Summarizing Opinions,” 2024

work page 2024