pith. sign in

arxiv: 2605.15208 · v1 · pith:ZT36ZURYnew · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Pith reviewed 2026-05-19 17:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords quantizationbias emergencelarge language modelsmodel compressionfairness evaluationBBQ benchmarkprecision levels
0
0 comments X

The pith

Quantization at 3 bits causes 6-21% of unbiased LLM items to develop new stereotypes while perplexity barely changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether post-training quantization preserves model behavior beyond standard quality measures. It tracks responses item by item on the BBQ benchmark as precision drops from BF16 to 3 bits in three instruction-tuned models. Results show a steady rise in new stereotypical answers that follows a dose-response curve, confirmed statistically, even though aggregate metrics like perplexity remain almost flat. The work demonstrates that fairness failures can appear at precision levels still considered safe by current evaluation standards. This gap matters because quantized models are widely deployed on edge devices where bias can affect real decisions.

Core claim

3-bit quantization causes 6-21% of previously unbiased BBQ items to develop new stereotypical behaviors following a clear dose-response pattern, with models' willingness to select unknown answers declining by 17.4%; these item-level shifts remain invisible to perplexity, which rises less than 0.5% at 8-bit and under 3% at 4-bit across models.

What carries the argument

Item-level bias tracking on 12,148 BBQ items across five precision levels and three models, analyzed with logistic regression for dose-response.

If this is right

  • Even 4-bit models already show new bias in 2.5-5.6% of items despite minimal perplexity shift.
  • Models become less willing to answer unknown, reducing by 17.4% at lowest precision.
  • Aggregate quality metrics systematically miss fairness degradation during compression.
  • Deployment of quantized models requires explicit bias testing before release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Compression pipelines for production use may need separate bias audits at each precision step.
  • Other compression methods such as pruning could produce similar hidden fairness shifts.
  • Post-quantization calibration focused on uncertainty and neutrality might reduce the observed bias increase.

Load-bearing premise

Changes in responses to BBQ items at lower bit widths reflect genuine emergence of stereotypical bias rather than random variation, model degradation, or evaluation noise.

What would settle it

Repeating the controlled runs and finding no statistically significant rise in stereotypical answers at 3-bit or 4-bit precision relative to BF16.

Figures

Figures reproduced from arXiv: 2605.15208 by Plawan Kumar Rath, Rahul Maliakkal.

Figure 2
Figure 2. Figure 2: Relative error in per-item SRS by bit-width. The monotonic increase [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Unknown Selection Rate (USR) decline under compression. All [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of effect sizes (Cohen’s h) between all items and latent-bias items (SRS ≥ 0.2 at BF16). The latent-bias subset shows medium-to-large effects, confirming that population-level metrics understate compression’s true impact. analysis pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models' willingness to select "unknown" answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that post-training quantization of instruction-tuned LLMs (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at precisions from BF16 to 3-bit induces new stereotypical biases on the BBQ benchmark. Across 12,148 items and 5 seeds (911,100 inferences), 6-21% of previously unbiased items develop stereotypical responses at 3-bit following a logistic-regression-confirmed dose-response, accompanied by a 17.4% drop in 'unknown' selections; these shifts are invisible to perplexity, which rises <0.5% at 8-bit and <3% at 4-bit.

Significance. If the central interpretation holds, the work is significant for demonstrating that standard aggregate metrics miss fairness-critical degradation under compression. Strengths include the multi-model, multi-precision design, large item count with statistical controls, explicit dose-response modeling via logistic regression, and the scale of controlled comparisons that allow distinguishing gradual from threshold effects.

major comments (2)
  1. [§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.
  2. [Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.
minor comments (3)
  1. [Abstract] Abstract: The 6-21% range should be disaggregated by model or reported with per-model confidence intervals to allow readers to assess consistency.
  2. [Figure 2] Figure 2 (dose-response plots): Add error bars or shaded regions reflecting the 5 random seeds so that the logistic regression fit can be visually assessed for robustness.
  3. [Table 1] Table 1: Clarify whether the 'previously unbiased' baseline is computed per model or pooled across models, as this affects the denominator for the 6-21% statistic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments raise important points about distinguishing bias emergence from general degradation and ensuring reproducibility. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.

    Authors: We agree that a disaggregation of anti-stereotypical responses is needed to strengthen the attribution to specific bias rather than uniform degradation in instruction following or abstention. Our logistic regression analysis models the dose-response specifically for stereotypical outputs, but we acknowledge the current presentation does not explicitly compare rates of increase across categories. In the revised manuscript we will add this breakdown to §4.2, reporting the relative changes in stereotypical, anti-stereotypical, and unknown selections across precision levels together with statistical tests for differential effects. This addition directly addresses the concern and supports the fairness-critical interpretation. revision: yes

  2. Referee: [Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.

    Authors: The referee is correct that the current description is insufficient for full reproducibility. In the revised manuscript we will expand §2.1 and Appendix A to specify the quantization library and version, the exact calibration dataset, the group size parameter, and an explicit statement that no post-quantization fine-tuning was applied. These details will allow independent replication of the observed threshold effects at 4-bit and 3-bit. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external benchmark

full rationale

This paper reports controlled experiments evaluating three instruction-tuned LLMs at five precision levels on 12,148 BBQ items, directly counting item-level shifts in stereotypical responses and 'unknown' selections across random seeds. No derivations, first-principles results, equations, or predictions are claimed. Logistic regression serves only to statistically confirm the observed dose-response in the measured data rather than generating or substituting for primary results. No self-citations are load-bearing, no parameters are fitted to a subset and then presented as predictions, and all outcomes derive from comparisons against the external BBQ benchmark. The study is self-contained with independent empirical content and no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations and standard statistical practices rather than new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Standard assumptions of logistic regression and averaging across random seeds apply to the bias measurements and dose-response analysis.
    Invoked to establish the clear dose-response pattern across precision levels.

pith-pipeline@v0.9.0 · 5800 in / 1433 out tokens · 80592 ms · 2026-05-19T17:50:48.954164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Large Language Models: A Survey

    S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

  2. [2]

    A survey of post-training scaling in large language models,

    Y . Xuet al., “A survey of post-training scaling in large language models,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), 2025

  3. [3]

    LLMCBench: Benchmarking large language model com- pression for efficient deployment,

    J. Xuet al., “LLMCBench: Benchmarking large language model com- pression for efficient deployment,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  4. [4]

    A survey of model compression techniques: Past, present, and future,

    Z. Liaoet al., “A survey of model compression techniques: Past, present, and future,”Frontiers in Robotics and AI, vol. 12, p. 1518965, 2025

  5. [5]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

  6. [6]

    A survey on hallucination in large language and foundation models,

    V . Rawte, A. Sheth, and A. Das, “A survey on hallucination in large language and foundation models,”Preprints, 202504.1236, 2025

  7. [7]

    Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,

    A. Thakuret al., “Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,”arXiv preprint arXiv:2502.19545, 2025

  8. [8]

    Bias and fairness in large language models: A survey,

    I. O. Gallegos, R. A. Rossi, J. Barber, M. M. Tanjim, S. Kim, F. Der- noncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

  9. [9]

    Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,

    Y . Renet al., “Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,” inFindings of the Assoc. Comput. Linguistics: EMNLP, 2024

  10. [10]

    A survey on out-of-distribution evalu- ation of neural NLP models,

    Z. Yuan, Y . Chen, and M. Xia, “A survey on out-of-distribution evalu- ation of neural NLP models,”arXiv preprint arXiv:2306.15261, 2023

  11. [11]

    A survey on large language model benchmarks, 2025a

    Q. Liuet al., “A survey on large language model benchmarks,”arXiv preprint arXiv:2508.15361, 2025

  12. [12]

    Ro- bust lottery tickets for pre-trained language models,

    H. Zheng, Q. Peng, Y . Yang, Z. Chen, Z. Zhu, and S. Poria, “Ro- bust lottery tickets for pre-trained language models,”arXiv preprint arXiv:2211.03013, 2022

  13. [13]

    Understanding and over- coming the challenges of efficient transformer quantization,

    Y . Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and over- coming the challenges of efficient transformer quantization,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7947–7969

  14. [14]

    The compression techniques applied on deep learning model,

    F. Ong, “The compression techniques applied on deep learning model,” Highlights in Science, Engineering and Technology, vol. 17, pp. 208– 213, 2022

  15. [15]

    X., Nie, J.-Y., and Wen, J.-R

    J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “The dawn after the dark: An empirical study on factuality hallucination in large language models,”arXiv preprint arXiv:2401.03205, 2024

  16. [16]

    Understanding the effect of model compression on social bias in large language models,

    G. Gonc ¸alves and E. Strubell, “Understanding the effect of model compression on social bias in large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023

  17. [17]

    How does quantization affect multilingual LLMs?

    K. Marchisioet al., “How does quantization affect multilingual LLMs?” arXiv preprint arXiv:2407.03211, 2024

  18. [18]

    Investigating the impact of quantization methods on the safety and reliability of large language models,

    Y . Linet al., “Investigating the impact of quantization methods on the safety and reliability of large language models,”arXiv preprint arXiv:2502.15799, 2025

  19. [19]

    What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

    S. Hooker, A. Courville, G. Clark, Y . Dauphin, and A. Frome, “What do compressed deep neural networks forget?”arXiv preprint arXiv:1911.05248, 2019

  20. [20]

    BBQ: A hand-built bias benchmark for question answering,

    A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” inFindings of the Assoc. Comput. Linguistics: ACL, 2022, pp. 2086–2105

  21. [21]

    MLX: An array framework for Apple Silicon,

    Apple Inc., “MLX: An array framework for Apple Silicon,” GitHub Repository, 2023. [Online]. Available: https://github.com/ml-explore/ mlx

  22. [22]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

  23. [23]

    Uncertainty drives social bias changes in quantized large language models,

    S. Z. Hua, S. Lotfi, and I. Y . Chen, “Uncertainty drives social bias changes in quantized large language models,”arXiv preprint arXiv:2602.06181, 2026

  24. [24]

    Alignment-aware quantization for LLM safety,

    S. Wee, S. Kim, H. Kim, K. Hwang, and N. Kwak, “Alignment-aware quantization for LLM safety,”arXiv preprint arXiv:2511.07842, 2025

  25. [25]

    Accuracy is not all you need,

    S. Dutta, A. Pandey, S. Chattopadhyay, T. Sinha, and S. Chakraborty, “Accuracy is not all you need,”arXiv preprint arXiv:2407.09141, 2024