Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Plawan Kumar Rath; Rahul Maliakkal

arxiv: 2605.15208 · v1 · pith:ZT36ZURYnew · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Plawan Kumar Rath , Rahul Maliakkal This is my paper

Pith reviewed 2026-05-19 17:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords quantizationbias emergencelarge language modelsmodel compressionfairness evaluationBBQ benchmarkprecision levels

0 comments

The pith

Quantization at 3 bits causes 6-21% of unbiased LLM items to develop new stereotypes while perplexity barely changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether post-training quantization preserves model behavior beyond standard quality measures. It tracks responses item by item on the BBQ benchmark as precision drops from BF16 to 3 bits in three instruction-tuned models. Results show a steady rise in new stereotypical answers that follows a dose-response curve, confirmed statistically, even though aggregate metrics like perplexity remain almost flat. The work demonstrates that fairness failures can appear at precision levels still considered safe by current evaluation standards. This gap matters because quantized models are widely deployed on edge devices where bias can affect real decisions.

Core claim

3-bit quantization causes 6-21% of previously unbiased BBQ items to develop new stereotypical behaviors following a clear dose-response pattern, with models' willingness to select unknown answers declining by 17.4%; these item-level shifts remain invisible to perplexity, which rises less than 0.5% at 8-bit and under 3% at 4-bit across models.

What carries the argument

Item-level bias tracking on 12,148 BBQ items across five precision levels and three models, analyzed with logistic regression for dose-response.

If this is right

Even 4-bit models already show new bias in 2.5-5.6% of items despite minimal perplexity shift.
Models become less willing to answer unknown, reducing by 17.4% at lowest precision.
Aggregate quality metrics systematically miss fairness degradation during compression.
Deployment of quantized models requires explicit bias testing before release.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compression pipelines for production use may need separate bias audits at each precision step.
Other compression methods such as pruning could produce similar hidden fairness shifts.
Post-quantization calibration focused on uncertainty and neutrality might reduce the observed bias increase.

Load-bearing premise

Changes in responses to BBQ items at lower bit widths reflect genuine emergence of stereotypical bias rather than random variation, model degradation, or evaluation noise.

What would settle it

Repeating the controlled runs and finding no statistically significant rise in stereotypical answers at 3-bit or 4-bit precision relative to BF16.

Figures

Figures reproduced from arXiv: 2605.15208 by Plawan Kumar Rath, Rahul Maliakkal.

**Figure 3.** Figure 3: Unknown Selection Rate (USR) decline under compression. All [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of effect sizes (Cohen’s h) between all items and latent-bias items (SRS ≥ 0.2 at BF16). The latent-bias subset shows medium-to-large effects, confirming that population-level metrics understate compression’s true impact. analysis pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models' willingness to select "unknown" answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Low-bit quantization introduces new stereotypical responses on BBQ items that perplexity misses, but the drop in 'unknown' answers could reflect general degradation more than targeted bias.

read the letter

The main thing to know is that 3-bit quantization leads to new stereotypical answers on 6-21% of BBQ items across models, with a clear dose-response, while perplexity stays almost flat. This suggests current checks miss fairness issues in compressed models. The work stands out for its scale and design. Three instruction-tuned models, five precision levels from BF16 to 3-bit, 12k+ items, five seeds, and logistic regression to track the pattern. Item-level analysis beats the usual aggregate bias scores, and the multi-model approach makes the finding more general. The soft spots are around interpretation. The reported 17.4% drop in 'unknown' answers could explain some shifts if models just pick an option more often in ambiguous cases. If the new answers split evenly between stereotypical and anti-stereotypical, it points to general instruction-following degradation rather than bias emergence specifically. The paper claims stereotypical behaviors, so they probably checked the polarity, but I'd want to see the exact counts to be sure. Methods details are a bit thin in the abstract too, like how quantization was applied exactly or full controls for other factors. This paper suits people focused on LLM safety and efficient deployment. Readers evaluating compression techniques or fairness benchmarks will get practical takeaways from the controlled comparisons. It has enough structure and data to merit serious referee time. I would recommend sending it to peer review, asking for more on response distributions and robustness checks.

Referee Report

2 major / 3 minor

Summary. The paper claims that post-training quantization of instruction-tuned LLMs (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at precisions from BF16 to 3-bit induces new stereotypical biases on the BBQ benchmark. Across 12,148 items and 5 seeds (911,100 inferences), 6-21% of previously unbiased items develop stereotypical responses at 3-bit following a logistic-regression-confirmed dose-response, accompanied by a 17.4% drop in 'unknown' selections; these shifts are invisible to perplexity, which rises <0.5% at 8-bit and <3% at 4-bit.

Significance. If the central interpretation holds, the work is significant for demonstrating that standard aggregate metrics miss fairness-critical degradation under compression. Strengths include the multi-model, multi-precision design, large item count with statistical controls, explicit dose-response modeling via logistic regression, and the scale of controlled comparisons that allow distinguishing gradual from threshold effects.

major comments (2)

[§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.
[Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.

minor comments (3)

[Abstract] Abstract: The 6-21% range should be disaggregated by model or reported with per-model confidence intervals to allow readers to assess consistency.
[Figure 2] Figure 2 (dose-response plots): Add error bars or shaded regions reflecting the 5 random seeds so that the logistic regression fit can be visually assessed for robustness.
[Table 1] Table 1: Clarify whether the 'previously unbiased' baseline is computed per model or pooled across models, as this affects the denominator for the 6-21% statistic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments raise important points about distinguishing bias emergence from general degradation and ensuring reproducibility. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (item-level shift analysis): The reported 6-21% emergence of new stereotypical behaviors is not accompanied by a breakdown showing whether anti-stereotypical answers increase at a comparable rate. Given the 17.4% decline in 'unknown' selections, a non-preferential default to one of the two named options would be expected to raise both stereotypical and anti-stereotypical counts roughly equally; without this disaggregation or a control for uniform abstention failure, the attribution to specific bias emergence (rather than general instruction-following degradation) is not yet established and directly affects the 'fairness-critical failures' conclusion.

Authors: We agree that a disaggregation of anti-stereotypical responses is needed to strengthen the attribution to specific bias rather than uniform degradation in instruction following or abstention. Our logistic regression analysis models the dose-response specifically for stereotypical outputs, but we acknowledge the current presentation does not explicitly compare rates of increase across categories. In the revised manuscript we will add this breakdown to §4.2, reporting the relative changes in stereotypical, anti-stereotypical, and unknown selections across precision levels together with statistical tests for differential effects. This addition directly addresses the concern and supports the fairness-critical interpretation. revision: yes
Referee: [Methods §2.1] Methods §2.1 and Appendix A: The quantization implementation details (library, calibration dataset, group size, and any post-quantization fine-tuning) are described at a high level only. Because the central claim concerns threshold-dependent safety failures at 4-bit and 3-bit, the absence of these parameters prevents independent verification that the observed item-level changes are reproducible rather than artifacts of a particular quantization recipe.

Authors: The referee is correct that the current description is insufficient for full reproducibility. In the revised manuscript we will expand §2.1 and Appendix A to specify the quantization library and version, the exact calibration dataset, the group size parameter, and an explicit statement that no post-quantization fine-tuning was applied. These details will allow independent replication of the observed threshold effects at 4-bit and 3-bit. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external benchmark

full rationale

This paper reports controlled experiments evaluating three instruction-tuned LLMs at five precision levels on 12,148 BBQ items, directly counting item-level shifts in stereotypical responses and 'unknown' selections across random seeds. No derivations, first-principles results, equations, or predictions are claimed. Logistic regression serves only to statistically confirm the observed dose-response in the measured data rather than generating or substituting for primary results. No self-citations are load-bearing, no parameters are fitted to a subset and then presented as predictions, and all outcomes derive from comparisons against the external BBQ benchmark. The study is self-contained with independent empirical content and no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations and standard statistical practices rather than new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Standard assumptions of logistic regression and averaging across random seeds apply to the bias measurements and dose-response analysis.
Invoked to establish the clear dose-response pattern across precision levels.

pith-pipeline@v0.9.0 · 5800 in / 1433 out tokens · 80592 ms · 2026-05-19T17:50:48.954164+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat.induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models' willingness to select 'unknown' answers declines by 17.4%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

[1]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

A survey of post-training scaling in large language models,

Y . Xuet al., “A survey of post-training scaling in large language models,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), 2025

work page 2025
[3]

LLMCBench: Benchmarking large language model com- pression for efficient deployment,

J. Xuet al., “LLMCBench: Benchmarking large language model com- pression for efficient deployment,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

work page 2024
[4]

A survey of model compression techniques: Past, present, and future,

Z. Liaoet al., “A survey of model compression techniques: Past, present, and future,”Frontiers in Robotics and AI, vol. 12, p. 1518965, 2025

work page 2025
[5]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

A survey on hallucination in large language and foundation models,

V . Rawte, A. Sheth, and A. Das, “A survey on hallucination in large language and foundation models,”Preprints, 202504.1236, 2025

work page arXiv 2025
[7]

Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,

A. Thakuret al., “Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,”arXiv preprint arXiv:2502.19545, 2025

work page arXiv 2025
[8]

Bias and fairness in large language models: A survey,

I. O. Gallegos, R. A. Rossi, J. Barber, M. M. Tanjim, S. Kim, F. Der- noncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

work page 2024
[9]

Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,

Y . Renet al., “Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,” inFindings of the Assoc. Comput. Linguistics: EMNLP, 2024

work page 2024
[10]

A survey on out-of-distribution evalu- ation of neural NLP models,

Z. Yuan, Y . Chen, and M. Xia, “A survey on out-of-distribution evalu- ation of neural NLP models,”arXiv preprint arXiv:2306.15261, 2023

work page arXiv 2023
[11]

A survey on large language model benchmarks, 2025a

Q. Liuet al., “A survey on large language model benchmarks,”arXiv preprint arXiv:2508.15361, 2025

work page arXiv 2025
[12]

Ro- bust lottery tickets for pre-trained language models,

H. Zheng, Q. Peng, Y . Yang, Z. Chen, Z. Zhu, and S. Poria, “Ro- bust lottery tickets for pre-trained language models,”arXiv preprint arXiv:2211.03013, 2022

work page arXiv 2022
[13]

Understanding and over- coming the challenges of efficient transformer quantization,

Y . Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and over- coming the challenges of efficient transformer quantization,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7947–7969

work page 2021
[14]

The compression techniques applied on deep learning model,

F. Ong, “The compression techniques applied on deep learning model,” Highlights in Science, Engineering and Technology, vol. 17, pp. 208– 213, 2022

work page 2022
[15]

X., Nie, J.-Y., and Wen, J.-R

J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “The dawn after the dark: An empirical study on factuality hallucination in large language models,”arXiv preprint arXiv:2401.03205, 2024

work page arXiv 2024
[16]

Understanding the effect of model compression on social bias in large language models,

G. Gonc ¸alves and E. Strubell, “Understanding the effect of model compression on social bias in large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[17]

How does quantization affect multilingual LLMs?

K. Marchisioet al., “How does quantization affect multilingual LLMs?” arXiv preprint arXiv:2407.03211, 2024

work page arXiv 2024
[18]

Investigating the impact of quantization methods on the safety and reliability of large language models,

Y . Linet al., “Investigating the impact of quantization methods on the safety and reliability of large language models,”arXiv preprint arXiv:2502.15799, 2025

work page arXiv 2025
[19]

What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

S. Hooker, A. Courville, G. Clark, Y . Dauphin, and A. Frome, “What do compressed deep neural networks forget?”arXiv preprint arXiv:1911.05248, 2019

work page arXiv 1911
[20]

BBQ: A hand-built bias benchmark for question answering,

A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” inFindings of the Assoc. Comput. Linguistics: ACL, 2022, pp. 2086–2105

work page 2022
[21]

MLX: An array framework for Apple Silicon,

Apple Inc., “MLX: An array framework for Apple Silicon,” GitHub Repository, 2023. [Online]. Available: https://github.com/ml-explore/ mlx

work page 2023
[22]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

work page 1988
[23]

Uncertainty drives social bias changes in quantized large language models,

S. Z. Hua, S. Lotfi, and I. Y . Chen, “Uncertainty drives social bias changes in quantized large language models,”arXiv preprint arXiv:2602.06181, 2026

work page arXiv 2026
[24]

Alignment-aware quantization for LLM safety,

S. Wee, S. Kim, H. Kim, K. Hwang, and N. Kwak, “Alignment-aware quantization for LLM safety,”arXiv preprint arXiv:2511.07842, 2025

work page arXiv 2025
[25]

Accuracy is not all you need,

S. Dutta, A. Pandey, S. Chattopadhyay, T. Sinha, and S. Chakraborty, “Accuracy is not all you need,”arXiv preprint arXiv:2407.09141, 2024

work page arXiv 2024

[1] [1]

Large Language Models: A Survey

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Am- atriain, and J. Gao, “Large language models: A survey,”arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

A survey of post-training scaling in large language models,

Y . Xuet al., “A survey of post-training scaling in large language models,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), 2025

work page 2025

[3] [3]

LLMCBench: Benchmarking large language model com- pression for efficient deployment,

J. Xuet al., “LLMCBench: Benchmarking large language model com- pression for efficient deployment,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

work page 2024

[4] [4]

A survey of model compression techniques: Past, present, and future,

Z. Liaoet al., “A survey of model compression techniques: Past, present, and future,”Frontiers in Robotics and AI, vol. 12, p. 1518965, 2025

work page 2025

[5] [5]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

A survey on hallucination in large language and foundation models,

V . Rawte, A. Sheth, and A. Das, “A survey on hallucination in large language and foundation models,”Preprints, 202504.1236, 2025

work page arXiv 2025

[7] [7]

Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,

A. Thakuret al., “Winning big with small models: Knowledge distil- lation vs. self-training for reducing hallucination in QA agents,”arXiv preprint arXiv:2502.19545, 2025

work page arXiv 2025

[8] [8]

Bias and fairness in large language models: A survey,

I. O. Gallegos, R. A. Rossi, J. Barber, M. M. Tanjim, S. Kim, F. Der- noncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

work page 2024

[9] [9]

Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,

Y . Renet al., “Beyond perplexity: Multi-dimensional safety evaluation of LLM compression,” inFindings of the Assoc. Comput. Linguistics: EMNLP, 2024

work page 2024

[10] [10]

A survey on out-of-distribution evalu- ation of neural NLP models,

Z. Yuan, Y . Chen, and M. Xia, “A survey on out-of-distribution evalu- ation of neural NLP models,”arXiv preprint arXiv:2306.15261, 2023

work page arXiv 2023

[11] [11]

A survey on large language model benchmarks, 2025a

Q. Liuet al., “A survey on large language model benchmarks,”arXiv preprint arXiv:2508.15361, 2025

work page arXiv 2025

[12] [12]

Ro- bust lottery tickets for pre-trained language models,

H. Zheng, Q. Peng, Y . Yang, Z. Chen, Z. Zhu, and S. Poria, “Ro- bust lottery tickets for pre-trained language models,”arXiv preprint arXiv:2211.03013, 2022

work page arXiv 2022

[13] [13]

Understanding and over- coming the challenges of efficient transformer quantization,

Y . Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and over- coming the challenges of efficient transformer quantization,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 7947–7969

work page 2021

[14] [14]

The compression techniques applied on deep learning model,

F. Ong, “The compression techniques applied on deep learning model,” Highlights in Science, Engineering and Technology, vol. 17, pp. 208– 213, 2022

work page 2022

[15] [15]

X., Nie, J.-Y., and Wen, J.-R

J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “The dawn after the dark: An empirical study on factuality hallucination in large language models,”arXiv preprint arXiv:2401.03205, 2024

work page arXiv 2024

[16] [16]

Understanding the effect of model compression on social bias in large language models,

G. Gonc ¸alves and E. Strubell, “Understanding the effect of model compression on social bias in large language models,” inProc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[17] [17]

How does quantization affect multilingual LLMs?

K. Marchisioet al., “How does quantization affect multilingual LLMs?” arXiv preprint arXiv:2407.03211, 2024

work page arXiv 2024

[18] [18]

Investigating the impact of quantization methods on the safety and reliability of large language models,

Y . Linet al., “Investigating the impact of quantization methods on the safety and reliability of large language models,”arXiv preprint arXiv:2502.15799, 2025

work page arXiv 2025

[19] [19]

What do compressed deep neural networks forget?arXiv preprint arXiv:1911.05248,

S. Hooker, A. Courville, G. Clark, Y . Dauphin, and A. Frome, “What do compressed deep neural networks forget?”arXiv preprint arXiv:1911.05248, 2019

work page arXiv 1911

[20] [20]

BBQ: A hand-built bias benchmark for question answering,

A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” inFindings of the Assoc. Comput. Linguistics: ACL, 2022, pp. 2086–2105

work page 2022

[21] [21]

MLX: An array framework for Apple Silicon,

Apple Inc., “MLX: An array framework for Apple Silicon,” GitHub Repository, 2023. [Online]. Available: https://github.com/ml-explore/ mlx

work page 2023

[22] [22]

Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988

work page 1988

[23] [23]

Uncertainty drives social bias changes in quantized large language models,

S. Z. Hua, S. Lotfi, and I. Y . Chen, “Uncertainty drives social bias changes in quantized large language models,”arXiv preprint arXiv:2602.06181, 2026

work page arXiv 2026

[24] [24]

Alignment-aware quantization for LLM safety,

S. Wee, S. Kim, H. Kim, K. Hwang, and N. Kwak, “Alignment-aware quantization for LLM safety,”arXiv preprint arXiv:2511.07842, 2025

work page arXiv 2025

[25] [25]

Accuracy is not all you need,

S. Dutta, A. Pandey, S. Chattopadhyay, T. Sinha, and S. Chakraborty, “Accuracy is not all you need,”arXiv preprint arXiv:2407.09141, 2024

work page arXiv 2024