Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
Pith reviewed 2026-05-19 17:50 UTC · model grok-4.3
The pith
Quantization at 3 bits causes 6-21% of unbiased LLM items to develop new stereotypes while perplexity barely changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
3-bit quantization causes 6-21% of previously unbiased BBQ items to develop new stereotypical behaviors following a clear dose-response pattern, with models' willingness to select unknown answers declining by 17.4%; these item-level shifts remain invisible to perplexity, which rises less than 0.5% at 8-bit and under 3% at 4-bit across models.
What carries the argument
Item-level bias tracking on 12,148 BBQ items across five precision levels and three models, analyzed with logistic regression for dose-response.
If this is right
- Even 4-bit models already show new bias in 2.5-5.6% of items despite minimal perplexity shift.
- Models become less willing to answer unknown, reducing by 17.4% at lowest precision.
- Aggregate quality metrics systematically miss fairness degradation during compression.
- Deployment of quantized models requires explicit bias testing before release.
Where Pith is reading between the lines
- Compression pipelines for production use may need separate bias audits at each precision step.
- Other compression methods such as pruning could produce similar hidden fairness shifts.
- Post-quantization calibration focused on uncertainty and neutrality might reduce the observed bias increase.
Load-bearing premise
Changes in responses to BBQ items at lower bit widths reflect genuine emergence of stereotypical bias rather than random variation, model degradation, or evaluation noise.
What would settle it
Repeating the controlled runs and finding no statistically significant rise in stereotypical answers at 3-bit or 4-bit precision relative to BF16.
Figures
read the original abstract
Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models' willingness to select "unknown" answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that post-training quantization of instruction-tuned LLMs (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at precisions from BF16 to 3-bit induces new stereotypical biases on the BBQ benchmark. Across 12,148 items and 5 seeds (911,100 inferences), 6-21% of previously unbiased items develop stereotypical responses at 3-bit following a logistic-regression-confirmed dose-response, accompanied by a 17.4% drop in 'unknown' selections; these shifts are invisible to perplexity, which rises <0.5% at 8-bit and <3% at 4-bit.
Significance. If the central interpretation holds, the work is significant for demonstrating that standard aggregate metrics miss fairness-critical degradation under compression. Strengths include the multi-model, multi-precision design, large item count with statistical controls, explicit dose-response modeling via logistic regression, and the scale of controlled comparisons that allow distinguishing gradual from threshold effects.
major comments (2)
- [§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.
- [Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.
minor comments (3)
- [Abstract] Abstract: The 6-21% range should be disaggregated by model or reported with per-model confidence intervals to allow readers to assess consistency.
- [Figure 2] Figure 2 (dose-response plots): Add error bars or shaded regions reflecting the 5 random seeds so that the logistic regression fit can be visually assessed for robustness.
- [Table 1] Table 1: Clarify whether the 'previously unbiased' baseline is computed per model or pooled across models, as this affects the denominator for the 6-21% statistic.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments raise important points about distinguishing bias emergence from general degradation and ensuring reproducibility. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.
Authors: We agree that a disaggregation of anti-stereotypical responses is needed to strengthen the attribution to specific bias rather than uniform degradation in instruction following or abstention. Our logistic regression analysis models the dose-response specifically for stereotypical outputs, but we acknowledge the current presentation does not explicitly compare rates of increase across categories. In the revised manuscript we will add this breakdown to §4.2, reporting the relative changes in stereotypical, anti-stereotypical, and unknown selections across precision levels together with statistical tests for differential effects. This addition directly addresses the concern and supports the fairness-critical interpretation. revision: yes
-
Referee: [Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.
Authors: The referee is correct that the current description is insufficient for full reproducibility. In the revised manuscript we will expand §2.1 and Appendix A to specify the quantization library and version, the exact calibration dataset, the group size parameter, and an explicit statement that no post-quantization fine-tuning was applied. These details will allow independent replication of the observed threshold effects at 4-bit and 3-bit. revision: yes
Circularity Check
No circularity: direct empirical measurements on external benchmark
full rationale
This paper reports controlled experiments evaluating three instruction-tuned LLMs at five precision levels on 12,148 BBQ items, directly counting item-level shifts in stereotypical responses and 'unknown' selections across random seeds. No derivations, first-principles results, equations, or predictions are claimed. Logistic regression serves only to statistically confirm the observed dose-response in the measured data rather than generating or substituting for primary results. No self-citations are load-bearing, no parameters are fitted to a subset and then presented as predictions, and all outcomes derive from comparisons against the external BBQ benchmark. The study is self-contained with independent empirical content and no reduction of claims to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions of logistic regression and averaging across random seeds apply to the bias measurements and dose-response analysis.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models' willingness to select 'unknown' answers declines by 17.4%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Large Language Models: A Survey
S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
A survey of post-training scaling in large language models,
Y . Xuet al., “A survey of post-training scaling in large language models,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), 2025
work page 2025
-
[3]
LLMCBench: Benchmarking large language model com- pression for efficient deployment,
J. Xuet al., “LLMCBench: Benchmarking large language model com- pression for efficient deployment,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024
work page 2024
-
[4]
A survey of model compression techniques: Past, present, and future,
Z. Liaoet al., “A survey of model compression techniques: Past, present, and future,”Frontiers in Robotics and AI, vol. 12, p. 1518965, 2025
work page 2025
-
[5]
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
A survey on hallucination in large language and foundation models,
V . Rawte, A. Sheth, and A. Das, “A survey on hallucination in large language and foundation models,”Preprints, 202504.1236, 2025
-
[7]
A. Thakuret al., “Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,”arXiv preprint arXiv:2502.19545, 2025
-
[8]
Bias and fairness in large language models: A survey,
I. O. Gallegos, R. A. Rossi, J. Barber, M. M. Tanjim, S. Kim, F. Der- noncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024
work page 2024
-
[9]
Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,
Y . Renet al., “Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,” inFindings of the Assoc. Comput. Linguistics: EMNLP, 2024
work page 2024
-
[10]
A survey on out-of-distribution evalu- ation of neural NLP models,
Z. Yuan, Y . Chen, and M. Xia, “A survey on out-of-distribution evalu- ation of neural NLP models,”arXiv preprint arXiv:2306.15261, 2023
-
[11]
A survey on large language model benchmarks, 2025a
Q. Liuet al., “A survey on large language model benchmarks,”arXiv preprint arXiv:2508.15361, 2025
-
[12]
Ro- bust lottery tickets for pre-trained language models,
H. Zheng, Q. Peng, Y . Yang, Z. Chen, Z. Zhu, and S. Poria, “Ro- bust lottery tickets for pre-trained language models,”arXiv preprint arXiv:2211.03013, 2022
-
[13]
Understanding and over- coming the challenges of efficient transformer quantization,
Y . Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and over- coming the challenges of efficient transformer quantization,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7947–7969
work page 2021
-
[14]
The compression techniques applied on deep learning model,
F. Ong, “The compression techniques applied on deep learning model,” Highlights in Science, Engineering and Technology, vol. 17, pp. 208– 213, 2022
work page 2022
-
[15]
J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “The dawn after the dark: An empirical study on factuality hallucination in large language models,”arXiv preprint arXiv:2401.03205, 2024
-
[16]
Understanding the effect of model compression on social bias in large language models,
G. Gonc ¸alves and E. Strubell, “Understanding the effect of model compression on social bias in large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[17]
How does quantization affect multilingual LLMs?
K. Marchisioet al., “How does quantization affect multilingual LLMs?” arXiv preprint arXiv:2407.03211, 2024
-
[18]
Y . Linet al., “Investigating the impact of quantization methods on the safety and reliability of large language models,”arXiv preprint arXiv:2502.15799, 2025
-
[19]
What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,
S. Hooker, A. Courville, G. Clark, Y . Dauphin, and A. Frome, “What do compressed deep neural networks forget?”arXiv preprint arXiv:1911.05248, 2019
-
[20]
BBQ: A hand-built bias benchmark for question answering,
A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” inFindings of the Assoc. Comput. Linguistics: ACL, 2022, pp. 2086–2105
work page 2022
-
[21]
MLX: An array framework for Apple Silicon,
Apple Inc., “MLX: An array framework for Apple Silicon,” GitHub Repository, 2023. [Online]. Available: https://github.com/ml-explore/ mlx
work page 2023
-
[22]
Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed
J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988
work page 1988
-
[23]
Uncertainty drives social bias changes in quantized large language models,
S. Z. Hua, S. Lotfi, and I. Y . Chen, “Uncertainty drives social bias changes in quantized large language models,”arXiv preprint arXiv:2602.06181, 2026
-
[24]
Alignment-aware quantization for LLM safety,
S. Wee, S. Kim, H. Kim, K. Hwang, and N. Kwak, “Alignment-aware quantization for LLM safety,”arXiv preprint arXiv:2511.07842, 2025
-
[25]
S. Dutta, A. Pandey, S. Chattopadhyay, T. Sinha, and S. Chakraborty, “Accuracy is not all you need,”arXiv preprint arXiv:2407.09141, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.