Quality Is Not a Safety Proxy Under Quantization

Sahil Kadadekar

arxiv: 2606.10154 · v1 · pith:3VULLKXEnew · submitted 2026-06-08 · 💻 cs.LG · cs.CR

Quality Is Not a Safety Proxy Under Quantization

Sahil Kadadekar This is my paper

Pith reviewed 2026-06-27 17:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords quantizationmodel safetyrefusal ratesquality metricshidden dangerRTSIGGUFAWQ GPTQ

0 comments

The pith

Quality retention does not ensure safety retention under model quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether quality metrics can stand in for direct safety tests on quantized models. It builds a 51-row matrix covering six models, four families, seven GGUF levels, and AWQ/GPTQ INT4 checkpoints. Within this matrix every quality-safety pairing splits direction across models, and nine rows qualify as hidden-danger: quality stays the same or rises while refusal of harmful prompts falls 12 to 68 points. The authors then show that a four-feature screen, the Refusal Template Stability Index, routes all ten hidden- or near-hidden-danger rows to safety testing while keeping most non-baseline rows in a low-risk bucket. The result is that, for the checkpoints and outcomes examined, quality numbers cannot replace safety checks.

Core claim

Across the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation. All 36 quality-safety pairings split direction across models; nine hidden-danger rows and one near-hidden-danger row keep quality stable or improved while refusal drops 12-68 points. Seven of eleven AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic check on 17 FP16/AWQ/GPTQ cells finds entropy, refusal-direction, and calibration probes weak or null, and safety-associated neurons absorb more quantization error overall but not in a regime-specific way. The Refusal Template Stability Index, built from four refusal-template drift features and calibra

What carries the argument

The hidden-danger classification (quality stable or improved while refusal on a predefined harmful-prompt set falls sharply) together with the Refusal Template Stability Index that flags rows for direct safety testing.

If this is right

Direct safety evaluation remains necessary even when quality metrics are retained after quantization.
The RTSI routes every hidden- or near-hidden-danger row to safety testing while keeping most other rows low-risk.
Single-feature baselines recover fewer hidden-danger rows than the calibrated RTSI at matched bucket size.
Cross-stack transfer of the RTSI requires recalibration on the new matrix.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quality-safety divergence could appear in untested quantization methods or larger model scales.
Safety checks might need to move earlier in the quantization pipeline rather than after quality screening.
Judge-model disagreement on refusal labels could shift hidden-danger counts in follow-up studies.

Load-bearing premise

The refusal rate on the paper's predefined harmful-prompt set is treated as a stable proxy for real-world safety behavior.

What would settle it

A new harmful-prompt distribution or different judge model that reverses the hidden-danger labels on the same quantized checkpoints.

Figures

Figures reproduced from arXiv: 2606.10154 by Sahil Kadadekar.

**Figure 2.** Figure 2: Paper pipeline for the completed study dataset. GGUF ladder rows and new AWQ/GPTQ [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: RTSI as a conservative study-internal deployment screen. Scores below 0.10 form the [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Second-judge robustness on the predefined 11,470-row stratified second-judge set. Claude [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints. In this matrix the shortcut fails: all 36 quality-safety pairings split direction across models, and 9 hidden-danger rows plus 1 near-hidden-danger row show quality stable or improved while refusal falls by 12-68 percentage points. Seven of the 11 AWQ/GPTQ rows are hidden-danger. A four-probe mechanistic follow-up over the 17 Hugging Face-backed FP16/AWQ/GPTQ cells does not rescue it: entropy, refusal-direction, and calibration probes are weak or null separators of dangerous rows, and although probe-identified safety-associated neurons absorb 1.39$\times$ more quantization error overall ($p < 5 \times 10^{-7}$), the effect is not regime-specific. Claude Sonnet 4 relabels 11,470 items in a predefined stratified set, agrees with the primary gemma3:12b judge on 89.9\% of rows ($\kappa = 0.873$, 95\% CI [0.866, 0.881]), and changes 0/10 hidden-danger cells. A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95\% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation; on the same matrix, the best single-feature baselines (unique-prefix-rate-delta, raw refusal-rate delta) recover 9/10 and 8/10 respectively at matched bucket size, and cross-stack transfer requires recalibration. For the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a concrete counter-example on 51 rows where quality holds but refusal drops under quantization, with RTSI as a fitted screen that routes the danger cases on this matrix.

read the letter

The punchline is that retained quality cannot stand in for direct safety checks on the quantized checkpoints and families tested here. Nine hidden-danger rows plus one near-miss show quality stable or better while refusal falls 12-68 points, and all 36 quality-safety pairings split direction across models. Seven of the eleven AWQ/GPTQ rows land in that hidden-danger group.

The work is useful because it supplies a matched matrix, counts the splits explicitly, and runs a four-probe mechanistic check that finds no clean separator. The RTSI construction, even though fitted, routes all ten critical rows to testing while keeping 23 of 45 others in a low-risk bucket, and the leave-one-out check holds. Judge agreement with Claude Sonnet 4 is reported at 89.9 percent with zero flips on the critical cells.

The soft spots are real but not fatal. RTSI weights and threshold are derived from the same 51 rows, so the perfect routing score is partly in-sample. The refusal labels rest on one predefined harmful-prompt set and one primary judge; the stress-test note is right that a shift in prompt distribution or judge model could reclassify some rows. No code or data are released, which makes independent checks harder. The mechanistic probes are exploratory and do not change the main finding.

This is for teams that deploy quantized models and need to decide when quality metrics are enough. It deserves a serious referee because the counter-example is quantified and the limitation is stated plainly rather than hidden. The central claim holds on the reported matrix even if broader generalization requires more data.

Referee Report

2 major / 2 minor

Summary. This manuscript audits the common practice of screening quantized LLMs first with quality metrics before (or instead of) direct safety tests. Using a matched 51-row matrix across 6 models, 4 families, a 7-level GGUF ladder, and AWQ/GPTQ INT4 checkpoints, it reports that all 36 quality-safety direction pairings split across models, with 9 hidden-danger rows (plus 1 near-hidden) in which quality is stable or improved while refusal rates on a stratified harmful-prompt set drop 12-68 pp. Mechanistic probes (entropy, refusal-direction, calibration) on the 17 HF-backed cells fail to separate dangerous rows, although safety-associated neurons absorb 1.39× more quantization error overall (p < 5×10^{-7}) without regime specificity. A study-internal Refusal Template Stability Index (RTSI) calibrated on the matrix routes 10/10 hidden/near-hidden rows to safety testing (Wilson lower bound 0.72) under both in-sample and leave-one-out validation while leaving 23/45 non-baseline rows low-risk; cross-judge validation with Claude Sonnet 4 yields κ=0.873 and zero flips among the 10 critical cells. The central conclusion is that, for the quantized checkpoints, model families, and safety outcomes studied here, retained quality cannot waive direct safety evaluation.

Significance. If the reported divergence holds under the paper's measurement protocol, the work supplies concrete, statistically supported evidence against a widespread deployment shortcut. Strengths include the matched-matrix design, the p < 5×10^{-7} neuron-error result, the 89.9 % cross-judge agreement with no change to hidden-danger classifications, and the explicit leave-one-out check on RTSI. The finding is scoped to the studied models and prompt set, which appropriately limits over-claim while still challenging current practice. The mechanistic follow-up, although null for separation, usefully documents the limits of those particular probes.

major comments (2)

[Abstract / hidden-danger rows] Abstract and hidden-danger identification paragraph: the 9 hidden-danger rows are defined solely by refusal-rate drops on the paper's fixed stratified harmful-prompt set (scored by gemma3:12b). While the Claude Sonnet 4 relabeling (89.9 % agreement, 0/10 flips) is reassuring, the core quality-safety divergence claim remains sensitive to prompt distribution or judge-model choice; a sensitivity table varying either factor would directly test robustness of the 12-68 pp deltas.
[RTSI description and validation] RTSI calibration and routing paragraph: feature weights and the decision threshold are fitted to the same 51-row matrix used for evaluation. Although row-level leave-one-out validation is reported and the paper notes that cross-stack transfer requires recalibration, the 10/10 routing performance is still partly in-sample; an external held-out model family would strengthen the practical claim that RTSI can reliably triage future checkpoints.

minor comments (2)

[RTSI validation] The Wilson 95 % CI lower bound of 0.72 for the 10/10 routing is stated without the underlying binomial parameters or exact proportion; adding the calculation details (n, k, method) would improve reproducibility.
[Results matrix] Table or matrix presentation: the 51-row matrix would benefit from an explicit column or annotation marking the 9 hidden-danger and 1 near-hidden rows to allow readers to trace the direction-split counts without cross-referencing text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and positive review. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract / hidden-danger rows] Abstract and hidden-danger identification paragraph: the 9 hidden-danger rows are defined solely by refusal-rate drops on the paper's fixed stratified harmful-prompt set (scored by gemma3:12b). While the Claude Sonnet 4 relabeling (89.9 % agreement, 0/10 flips) is reassuring, the core quality-safety divergence claim remains sensitive to prompt distribution or judge-model choice; a sensitivity table varying either factor would directly test robustness of the 12-68 pp deltas.

Authors: The manuscript already includes substantial validation through cross-judge agreement with Claude Sonnet 4 (89.9% agreement, κ=0.873, zero flips on the 10 critical cells). This directly tests judge-model choice. For prompt distribution sensitivity, we agree it would be informative. However, the stratified prompt set is fixed and designed to cover a range of harm types. We will add a brief sensitivity discussion using prompt subsets in the revision to address this. revision: partial
Referee: [RTSI description and validation] RTSI calibration and routing paragraph: feature weights and the decision threshold are fitted to the same 51-row matrix used for evaluation. Although row-level leave-one-out validation is reported and the paper notes that cross-stack transfer requires recalibration, the 10/10 routing performance is still partly in-sample; an external held-out model family would strengthen the practical claim that RTSI can reliably triage future checkpoints.

Authors: We agree that the RTSI is calibrated in-sample. The row-level leave-one-out validation and the explicit statement that cross-stack transfer requires recalibration are already present in the manuscript. While an external held-out model family would provide additional evidence, the current study spans 4 families and 6 models, and the LOO results support the method within this scope. We will clarify the in-sample nature more explicitly in the revision. revision: partial

Circularity Check

1 steps flagged

RTSI routing performance is calibrated on the evaluation matrix

specific steps

fitted input called prediction [Abstract]
"A calibrated study-internal behavioral screen -- the Refusal Template Stability Index (RTSI), built from four refusal-template drift features and calibrated on this matrix -- routes 10/10 hidden- or near-hidden-danger rows to direct safety testing (Wilson 95% CI lower bound 0.72) while leaving 23 of 45 non-baseline rows in a low-risk bucket under both in-sample scoring and row-level leave-one-out validation"

RTSI features and calibration are derived from the same matrix that defines the hidden-danger rows and on which routing performance is measured; the reported 10/10 success is therefore a fitted outcome on the calibration data even after LOO, rather than an independent prediction.

full rationale

The paper's central claim (quality fails as safety proxy due to 9 hidden-danger rows) is a direct empirical observation from the 51-row matrix and does not depend on RTSI. RTSI itself is calibrated on the identical matrix to detect those rows and its 10/10 routing (with LOO) is presented as a result, matching the fitted_input_called_prediction pattern for that auxiliary component. No self-citations, self-definitional equations, or other load-bearing circular steps appear. The core derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central empirical claim rests on the assumption that the chosen refusal prompts and judge model produce a stable safety signal, plus the four hand-chosen drift features that define RTSI. No new physical entities are introduced.

free parameters (1)

RTSI feature weights and decision threshold
The four refusal-template drift features are combined into an index whose weights and cutoff are chosen to achieve 10/10 routing on this matrix.

axioms (1)

domain assumption Refusal rate on the paper's harmful-prompt set is a valid proxy for safety behavior
Invoked when classifying rows as hidden-danger and when calibrating RTSI.

invented entities (1)

Refusal Template Stability Index (RTSI) no independent evidence
purpose: Lightweight behavioral screen to decide whether a quantized checkpoint needs full safety testing
New composite metric built from four drift features and calibrated on the study matrix.

pith-pipeline@v0.9.1-grok · 5928 in / 1562 out tokens · 19169 ms · 2026-06-27T17:21:13.133343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 9 canonical work pages

[1]

Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024. URLhttps://arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024
[2]

Colin R. Blyth. On Simpson’s paradox and the sure-thing principle.Journal of the American Statistical Association, 67(338):364–366, 1972

1972
[3]

Pappas, Florian Tramèr, et al

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (Datasets ...

work page doi:10.52202/079017-1745 2024
[4]

Towards understanding and improving refusal in compressed models via mechanistic interpretability.arXiv preprint arXiv:2504.04215, 2025

Vishnu Kabir Chhabra and Mohammad Mahdi Khalili. Towards understanding and improving refusal in compressed models via mechanistic interpretability.arXiv preprint arXiv:2504.04215, 2025. URLhttps://arxiv.org/abs/2504.04215. 9

arXiv 2025
[5]

Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018
[6]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems 36,
[7]

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

2023
[8]

How post-training reshapes LLMs: A mechanistic view on knowledge, truthfulness, refusal, and confidence.arXiv preprint arXiv:2504.02904, 2025

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, and Shichang Zhang. How post-training reshapes LLMs: A mechanistic view on knowledge, truthfulness, refusal, and confidence.arXiv preprint arXiv:2504.02904, 2025. URL https://arxiv.org/abs/2504.02904

arXiv 2025
[9]

Hashimoto

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InAdvances in Neural Information Processing Systems 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 5fc47800e...

2023
[11]

URLhttps://arxiv.org/abs/2404.04475

Pith/arXiv arXiv
[12]

Spurious correlations in reference-free evaluation of text generation

Esin Durmus, Faisal Ladhak, and Tatsunori Hashimoto. Spurious correlations in reference-free evaluation of text generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1443–1454. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.102. URL https://acla...

work page doi:10.18653/v1/2022.acl-long.102 2022
[13]

Exploiting LLM quantization.arXiv preprint arXiv:2405.18137, 2024

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting LLM quantization.arXiv preprint arXiv:2405.18137, 2024. URL https://arxiv.org/abs/2405.18137

arXiv 2024
[14]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. URLhttps://arxiv.org/abs/2210.17323

Pith/arXiv arXiv 2022
[15]

GGML: Tensor library for machine learning, 2023

Georgi Gerganov. GGML: Tensor library for machine learning, 2023. URL https://github.com/ggerganov/ggml. GitHub repository

2023
[16]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[17]

Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, and Bo Li

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R. Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, and Bo Li. Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLMs under compression. InProceedings of the 41st Intern...

2024
[18]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023. URL https://arxiv.org/abs/2307.04657. 10

arXiv 2023
[19]

Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799, 2025

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Mikhail Bykov, and Evgeny Burnaev. Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799, 2025. URL https://arxiv.org/abs/2502.15799

arXiv 2025
[20]

Scaling laws for precision

Tanishq Kumar, Mansheej Paul, and Aditi Raghunathan. Scaling laws for precision. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wg1PCg3CUP

2025
[21]

Equivalence tests: A practical primer fort-tests, correlations, and meta-analyses.Social Psychological and Personality Science, 8(4):355–362, 2017

Daniël Lakens. Equivalence tests: A practical primer fort-tests, correlations, and meta-analyses.Social Psychological and Personality Science, 8(4):355–362, 2017. doi: 10.1177/1948550617697177

work page doi:10.1177/1948550617697177 2017
[22]

Li, Satyapriya Krishna, and Himabindu Lakkaraju

Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju. More RLHF, more trust? on the impact of preference alignment on trustworthiness. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FpiCLJrSW8

2025
[23]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=iO4LZibEqW

2023
[24]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, 2004. URLhttps://aclanthology.org/W04-1013/

2004
[25]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024. URL https://proceedings.mlsys.org/paper_files/paper/2024/hash/ 42a452cbafa9dd64e9...

2024
[26]

T ruthful QA : Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl...

work page doi:10.18653/v1/2022.acl-long.229 2022
[27]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machi...

2024
[28]

BBQ : A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-acl.1...

work page doi:10.18653/v1/2022.findings-acl.165 2022
[29]

Comment: Understanding Simpson’s paradox.The American Statistician, 68(1): 8–13, 2014

Judea Pearl. Comment: Understanding Simpson’s paradox.The American Statistician, 68(1): 8–13, 2014

2014
[30]

When quantization affects confidence of large language models?Findings of the Association for Computational Linguistics: NAACL 2024, pages 1918–1928, 2024

Irina Proskurina, Luc Brun, Guillaume Metzler, and Julien Velcin. When quantization affects confidence of large language models?Findings of the Association for Computational Linguistics: NAACL 2024, pages 1918–1928, 2024. doi: 10.18653/v1/2024.findings-naacl.124. URLhttps://aclanthology.org/2024.findings-naacl.124/

work page doi:10.18653/v1/2024.findings-naacl.124 2024
[31]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE. 11

2025
[32]

Safetywashing: Do AI safety benchmarks actually measure safety progress? InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024

Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI safety benchmarks actually measure safety progress? InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024. doi: 10.52202/0790...

work page doi:10.52202/079017-2190 2024
[33]

Measurement to meaning: A validity-centered framework for AI evaluation.arXiv preprint arXiv:2505.10573, 2025

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, and Sanmi Koyejo. Measurement to meaning: A validity-centered framework for AI evaluation.arXiv preprint arXiv:2505.10573, 2025. URL https://arxiv.org/abs/2505.10573

arXiv 2025
[34]

Edward H. Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society: Series B (Methodological), 13(2):238–241, 1951

1951
[35]

Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment.arXiv preprint arXiv:2601.08089, 2026

Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, and Geng Yuan. Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment.arXiv preprint arXiv:2601.08089, 2026. URL https://arxiv.org/abs/2601.08089

arXiv 2026
[36]

Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335, 2025

Sang Truong, Yuheng Tu, Percy Liang, Bo Li, and Sanmi Koyejo. Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335, 2025. URL https://arxiv.org/abs/2503.13335

arXiv 2025
[37]

Safety-preserving PTQ via contrastive alignment loss.arXiv preprint arXiv:2511.07842, 2025

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, and Nojun Kwak. Safety-preserving PTQ via contrastive alignment loss.arXiv preprint arXiv:2511.07842, 2025. URLhttps://arxiv.org/abs/2511.07842

arXiv 2025
[38]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 52588–52...

2024
[39]

Holistic safety and responsibility evaluations of advanced AI models.arXiv preprint arXiv:2404.14068, 2024

Laura Weidinger, Joslyn Barnhart, Jenny Brennan, Christina Butterfield, Susie Young, Will Hawkins, Lisa Anne Hendricks, Ramona Comanescu, Oscar Chang, Mikel Rodriguez, Jennifer Beroshi, Dawn Bloxwich, Lev Proleev, Jilin Chen, Sebastian Farquhar, Lewis Ho, Iason Gabriel, Allan Dafoe, and William Isaac. Holistic safety and responsibility evaluations of adva...

arXiv 2024
[40]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023. URL https://proceedings.mlr.p...

2023
[41]

Beyond perplexity: Multi-dimensional safety evaluation of LLM compression

Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, and Vivek Srikumar. Beyond perplexity: Multi-dimensional safety evaluation of LLM compression. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15359–15396. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-emnlp.901. URL https://aclanthology.org/2...

work page doi:10.18653/v1/2024.findings-emnlp.901 2024
[42]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SkeHuCVFDr

2020
[43]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural 12 Information Processing Systems 36 (Datasets and Benchmarks Track), 2023. URL https://proceedings.neu...

2023
[44]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12: 1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12: 1556–1577, 2024. doi: 10.1162/tacl_a_00704. URL https://aclanthology.org/2024.tacl-1.85/

work page doi:10.1162/tacl_a_00704 2024
[45]

i cannot fulfill your request

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. URLhttps://arxiv.org/abs/2307.15043. A Ethical Considerations, Data Availability, and Disclosures A.1 Ethical Considerations This paper analyzes quality an...

Pith/arXiv arXiv 2023
[46]

Claims.Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?Answer:Yes.Justification:The abstract and §1 describe the study as a scoped proxy-validity audit over a matched 51-row matrix, and every headline number (36/36 sign splits, 9 hidden-danger + 1 near-hidden row, 7/11 AWQ/GPTQ rows, RTSI low...
[47]

Limitations.Does the paper discuss the limitations of the work performed by the authors? Answer:Yes.Justification:§5 enumerates the 6-model / 4-family / ≤7B coverage limitation, the within-stack conclusion, the judge-family overlap, the RTSI row-level-LOOCV-only calibration, and the benchmark-breadth future work
[48]

Theory assumptions and proofs.For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?Answer:NA.Justification:The paper contains no theoretical theorems; all claims are empirical over the 51-row matrix
[49]

Experimental result reproducibility.Does the paper fully disclose all the information needed to reproduce the main experimental results to the extent that it affects the main claims?Answer:Partial.Justification:§3 documents the matched-matrix construction, task slices, quantization pipeline, and judge configuration, and the reproduction package (via scrip...
[50]

Open access to data and code.Does the paper provide open access to data and code, with sufficient instructions to faithfully reproduce the main experimental results?Answer:Partial. Justification:Aggregated evaluation outputs, second-judge relabellings, mechanistic-probe statistics, and the analysis harness are released in the reproducibility bundle descri...
[51]

Experimental setting/details.Does the paper specify all training and test details (splits, hyperparameters, optimizer, etc.) necessary to understand the results?Answer:Yes. Justification:§3 and §3.1 specify task prompts per cell, the GGUF ladder, AWQ and GPTQ INT4 configurations, decoding settings, and judge prompts; the submission-materials appendix list...
[52]

Experiment statistical significance.Does the paper report error bars or confidence intervals or statistical significance tests?Answer:Yes.Justification:§4 reports second-judge Cohen’s κ with a 95% bootstrap CI, the safety-neuron quantization-error effect withp= 4.89×10 −7, and sign-heterogeneity across all 36 quality–safety pairings; RTSI screening is val...
[53]

Experiments compute resources.Does the paper provide sufficient compute detail to reproduce the experiments?Answer:Yes.Justification:§A.6 in the submission-materials appendix lists the RTX 4080 Laptop + RTX 6000 Ada split, the per-phase GPU-hours (40 + 35 + 15 + 12 GPU-hours), and the approximate $25 Claude Sonnet 4 API cost
[54]

Code of ethics.Does the research conform with the NeurIPS Code of Ethics in every respect?Answer:Yes.Justification:The Ethics subsection in §A.11 documents the use of publicly released harmful-prompt suites without introducing new harmful content, researcher-credit-account API usage, and the absence of human subjects
[55]

safety-passed

Broader impacts.Does the paper discuss both potential positive and negative societal impacts?Answer:Yes.Justification:The Broader Impact subsection in §A.11 names the positive outcome (deployment triage against quality-as-safety-proxy substitution) and the negative risk (RTSI misuse as a general “safety-passed” stamp outside the studied matrix)
[56]

Safeguards.Does the paper describe safeguards for responsible release of high-misuse- risk data or models?Answer:Yes.Justification:§A.11 and the submission-materials appendix state that the reproduction package releases aggregated refusal, truthfulness, and bias-resistance statistics only; no verbatim harmful completions are redistributed, and harmful-pro...
[57]

Licenses for existing assets.Are creators/original owners of used assets properly credited with license and terms of use?Answer:Yes.Justification: refs.bib cites the AdvBench, TruthfulQA, BBQ, BERTScore, ROUGE, AutoAWQ, AutoGPTQ, andllama.cpp/Ollama sources used in the pipeline described in §3; model-card license terms are inherited from the upstream Hugg...
[58]

New assets.Are new assets introduced in the paper well documented?Answer:Yes. Justification:The new assets are the 51-row quality–safety matrix, the RTSI screening heuristic, and the four-probe mechanistic follow-up outputs; all three are described in §3, §4.4, and the submission-materials appendix, and are packaged by the reproducibility bundle builder c...
[59]

Crowdsourcing and human subjects.For crowdsourcing and research with human subjects, does the paper include instructions, screenshots, and compensation details?Answer:NA. Justification:No crowdsourcing or human-subjects data collection was performed; LLM judges (gemma3:12b primary, Claude Sonnet 4 secondary) are methodological infrastructure, not human raters
[60]

IRB approvals.Does the paper describe potential participant risks, disclosure, and IRB (or equivalent) approvals?Answer:NA.Justification:No human subjects were involved, so no IRB review was required, as stated in the Ethics subsection of §A.11. 20
[61]

LLM usage.Does the paper declare LLM usage if it is an important, original, or non- standard component of the core methods?Answer:Yes.Justification:The primary judge gemma3:12b and the Claude Sonnet 4 second judge are core methodological components declared in §3 and §4; agreement,κ, and cell-level regime stability are reported explicitly. 21

[1] [1]

Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024. URLhttps://arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024

[2] [2]

Colin R. Blyth. On Simpson’s paradox and the sure-thing principle.Journal of the American Statistical Association, 67(338):364–366, 1972

1972

[3] [3]

Pappas, Florian Tramèr, et al

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 37 (Datasets ...

work page doi:10.52202/079017-1745 2024

[4] [4]

Towards understanding and improving refusal in compressed models via mechanistic interpretability.arXiv preprint arXiv:2504.04215, 2025

Vishnu Kabir Chhabra and Mohammad Mahdi Khalili. Towards understanding and improving refusal in compressed models via mechanistic interpretability.arXiv preprint arXiv:2504.04215, 2025. URLhttps://arxiv.org/abs/2504.04215. 9

arXiv 2025

[5] [5]

Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457

Pith/arXiv arXiv 2018

[6] [6]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems 36,

[7] [7]

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

2023

[8] [8]

How post-training reshapes LLMs: A mechanistic view on knowledge, truthfulness, refusal, and confidence.arXiv preprint arXiv:2504.02904, 2025

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, and Shichang Zhang. How post-training reshapes LLMs: A mechanistic view on knowledge, truthfulness, refusal, and confidence.arXiv preprint arXiv:2504.02904, 2025. URL https://arxiv.org/abs/2504.02904

arXiv 2025

[9] [9]

Hashimoto

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaFarm: A simulation framework for methods that learn from human feedback. InAdvances in Neural Information Processing Systems 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 5fc47800e...

2023

[10] [11]

URLhttps://arxiv.org/abs/2404.04475

Pith/arXiv arXiv

[11] [12]

Spurious correlations in reference-free evaluation of text generation

Esin Durmus, Faisal Ladhak, and Tatsunori Hashimoto. Spurious correlations in reference-free evaluation of text generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1443–1454. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.102. URL https://acla...

work page doi:10.18653/v1/2022.acl-long.102 2022

[12] [13]

Exploiting LLM quantization.arXiv preprint arXiv:2405.18137, 2024

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Martin Vechev. Exploiting LLM quantization.arXiv preprint arXiv:2405.18137, 2024. URL https://arxiv.org/abs/2405.18137

arXiv 2024

[13] [14]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. URLhttps://arxiv.org/abs/2210.17323

Pith/arXiv arXiv 2022

[14] [15]

GGML: Tensor library for machine learning, 2023

Georgi Gerganov. GGML: Tensor library for machine learning, 2023. URL https://github.com/ggerganov/ggml. GitHub repository

2023

[15] [16]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

2021

[16] [17]

Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, and Bo Li

Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R. Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, and Bo Li. Decoding compressed trust: Scrutinizing the trustworthiness of efficient LLMs under compression. InProceedings of the 41st Intern...

2024

[17] [18]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023. URL https://arxiv.org/abs/2307.04657. 10

arXiv 2023

[18] [19]

Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799, 2025

Artyom Kharinaev, Viktor Moskvoretskii, Egor Shvetsov, Kseniia Studenikina, Mikhail Bykov, and Evgeny Burnaev. Investigating the impact of quantization methods on the safety and reliability of large language models.arXiv preprint arXiv:2502.15799, 2025. URL https://arxiv.org/abs/2502.15799

arXiv 2025

[19] [20]

Scaling laws for precision

Tanishq Kumar, Mansheej Paul, and Aditi Raghunathan. Scaling laws for precision. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wg1PCg3CUP

2025

[20] [21]

Equivalence tests: A practical primer fort-tests, correlations, and meta-analyses.Social Psychological and Personality Science, 8(4):355–362, 2017

Daniël Lakens. Equivalence tests: A practical primer fort-tests, correlations, and meta-analyses.Social Psychological and Personality Science, 8(4):355–362, 2017. doi: 10.1177/1948550617697177

work page doi:10.1177/1948550617697177 2017

[21] [22]

Li, Satyapriya Krishna, and Himabindu Lakkaraju

Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju. More RLHF, more trust? on the impact of preference alignment on trustworthiness. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FpiCLJrSW8

2025

[22] [23]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=iO4LZibEqW

2023

[23] [24]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, 2004. URLhttps://aclanthology.org/W04-1013/

2004

[24] [25]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration.Proceedings of Machine Learning and Systems, 6, 2024. URL https://proceedings.mlsys.org/paper_files/paper/2024/hash/ 42a452cbafa9dd64e9...

2024

[25] [26]

T ruthful QA : Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl...

work page doi:10.18653/v1/2022.acl-long.229 2022

[26] [27]

HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machi...

2024

[27] [28]

BBQ : A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Association for Computational Linguistics, 2022. doi: 10.18653/v1/2022.findings-acl.1...

work page doi:10.18653/v1/2022.findings-acl.165 2022

[28] [29]

Comment: Understanding Simpson’s paradox.The American Statistician, 68(1): 8–13, 2014

Judea Pearl. Comment: Understanding Simpson’s paradox.The American Statistician, 68(1): 8–13, 2014

2014

[29] [30]

When quantization affects confidence of large language models?Findings of the Association for Computational Linguistics: NAACL 2024, pages 1918–1928, 2024

Irina Proskurina, Luc Brun, Guillaume Metzler, and Julien Velcin. When quantization affects confidence of large language models?Findings of the Association for Computational Linguistics: NAACL 2024, pages 1918–1928, 2024. doi: 10.18653/v1/2024.findings-naacl.124. URLhttps://aclanthology.org/2024.findings-naacl.124/

work page doi:10.18653/v1/2024.findings-naacl.124 2024

[30] [31]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE. 11

2025

[31] [32]

Safetywashing: Do AI safety benchmarks actually measure safety progress? InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024

Richard Ren, Steven Basart, Adam Khoja, Alexander Pan, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Gabriel Mukobi, Ryan Hwang Kim, Stephen Fitz, and Dan Hendrycks. Safetywashing: Do AI safety benchmarks actually measure safety progress? InAdvances in Neural Information Processing Systems 37 (Datasets and Benchmarks Track), 2024. doi: 10.52202/0790...

work page doi:10.52202/079017-2190 2024

[32] [33]

Measurement to meaning: A validity-centered framework for AI evaluation.arXiv preprint arXiv:2505.10573, 2025

Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, and Sanmi Koyejo. Measurement to meaning: A validity-centered framework for AI evaluation.arXiv preprint arXiv:2505.10573, 2025. URL https://arxiv.org/abs/2505.10573

arXiv 2025

[33] [34]

Edward H. Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society: Series B (Methodological), 13(2):238–241, 1951

1951

[34] [35]

Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment.arXiv preprint arXiv:2601.08089, 2026

Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, and Geng Yuan. Q-realign: Piggybacking realignment on quantization for safe and efficient LLM deployment.arXiv preprint arXiv:2601.08089, 2026. URL https://arxiv.org/abs/2601.08089

arXiv 2026

[35] [36]

Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335, 2025

Sang Truong, Yuheng Tu, Percy Liang, Bo Li, and Sanmi Koyejo. Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335, 2025. URL https://arxiv.org/abs/2503.13335

arXiv 2025

[36] [37]

Safety-preserving PTQ via contrastive alignment loss.arXiv preprint arXiv:2511.07842, 2025

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, and Nojun Kwak. Safety-preserving PTQ via contrastive alignment loss.arXiv preprint arXiv:2511.07842, 2025. URLhttps://arxiv.org/abs/2511.07842

arXiv 2025

[37] [38]

Assessing the brittleness of safety alignment via pruning and low-rank modifications

Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 52588–52...

2024

[38] [39]

Holistic safety and responsibility evaluations of advanced AI models.arXiv preprint arXiv:2404.14068, 2024

Laura Weidinger, Joslyn Barnhart, Jenny Brennan, Christina Butterfield, Susie Young, Will Hawkins, Lisa Anne Hendricks, Ramona Comanescu, Oscar Chang, Mikel Rodriguez, Jennifer Beroshi, Dawn Bloxwich, Lev Proleev, Jilin Chen, Sebastian Farquhar, Lewis Ho, Iason Gabriel, Allan Dafoe, and William Isaac. Holistic safety and responsibility evaluations of adva...

arXiv 2024

[39] [40]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023. URL https://proceedings.mlr.p...

2023

[40] [41]

Beyond perplexity: Multi-dimensional safety evaluation of LLM compression

Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, and Vivek Srikumar. Beyond perplexity: Multi-dimensional safety evaluation of LLM compression. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15359–15396. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.findings-emnlp.901. URL https://aclanthology.org/2...

work page doi:10.18653/v1/2024.findings-emnlp.901 2024

[41] [42]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SkeHuCVFDr

2020

[42] [43]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural 12 Information Processing Systems 36 (Datasets and Benchmarks Track), 2023. URL https://proceedings.neu...

2023

[43] [44]

A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12: 1556–1577, 2024

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models.Transactions of the Association for Computational Linguistics, 12: 1556–1577, 2024. doi: 10.1162/tacl_a_00704. URL https://aclanthology.org/2024.tacl-1.85/

work page doi:10.1162/tacl_a_00704 2024

[44] [45]

i cannot fulfill your request

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. URLhttps://arxiv.org/abs/2307.15043. A Ethical Considerations, Data Availability, and Disclosures A.1 Ethical Considerations This paper analyzes quality an...

Pith/arXiv arXiv 2023

[45] [46]

Claims.Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?Answer:Yes.Justification:The abstract and §1 describe the study as a scoped proxy-validity audit over a matched 51-row matrix, and every headline number (36/36 sign splits, 9 hidden-danger + 1 near-hidden row, 7/11 AWQ/GPTQ rows, RTSI low...

[46] [47]

Limitations.Does the paper discuss the limitations of the work performed by the authors? Answer:Yes.Justification:§5 enumerates the 6-model / 4-family / ≤7B coverage limitation, the within-stack conclusion, the judge-family overlap, the RTSI row-level-LOOCV-only calibration, and the benchmark-breadth future work

[47] [48]

Theory assumptions and proofs.For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?Answer:NA.Justification:The paper contains no theoretical theorems; all claims are empirical over the 51-row matrix

[48] [49]

Experimental result reproducibility.Does the paper fully disclose all the information needed to reproduce the main experimental results to the extent that it affects the main claims?Answer:Partial.Justification:§3 documents the matched-matrix construction, task slices, quantization pipeline, and judge configuration, and the reproduction package (via scrip...

[49] [50]

Open access to data and code.Does the paper provide open access to data and code, with sufficient instructions to faithfully reproduce the main experimental results?Answer:Partial. Justification:Aggregated evaluation outputs, second-judge relabellings, mechanistic-probe statistics, and the analysis harness are released in the reproducibility bundle descri...

[50] [51]

Experimental setting/details.Does the paper specify all training and test details (splits, hyperparameters, optimizer, etc.) necessary to understand the results?Answer:Yes. Justification:§3 and §3.1 specify task prompts per cell, the GGUF ladder, AWQ and GPTQ INT4 configurations, decoding settings, and judge prompts; the submission-materials appendix list...

[51] [52]

Experiment statistical significance.Does the paper report error bars or confidence intervals or statistical significance tests?Answer:Yes.Justification:§4 reports second-judge Cohen’s κ with a 95% bootstrap CI, the safety-neuron quantization-error effect withp= 4.89×10 −7, and sign-heterogeneity across all 36 quality–safety pairings; RTSI screening is val...

[52] [53]

Experiments compute resources.Does the paper provide sufficient compute detail to reproduce the experiments?Answer:Yes.Justification:§A.6 in the submission-materials appendix lists the RTX 4080 Laptop + RTX 6000 Ada split, the per-phase GPU-hours (40 + 35 + 15 + 12 GPU-hours), and the approximate $25 Claude Sonnet 4 API cost

[53] [54]

Code of ethics.Does the research conform with the NeurIPS Code of Ethics in every respect?Answer:Yes.Justification:The Ethics subsection in §A.11 documents the use of publicly released harmful-prompt suites without introducing new harmful content, researcher-credit-account API usage, and the absence of human subjects

[54] [55]

safety-passed

Broader impacts.Does the paper discuss both potential positive and negative societal impacts?Answer:Yes.Justification:The Broader Impact subsection in §A.11 names the positive outcome (deployment triage against quality-as-safety-proxy substitution) and the negative risk (RTSI misuse as a general “safety-passed” stamp outside the studied matrix)

[55] [56]

Safeguards.Does the paper describe safeguards for responsible release of high-misuse- risk data or models?Answer:Yes.Justification:§A.11 and the submission-materials appendix state that the reproduction package releases aggregated refusal, truthfulness, and bias-resistance statistics only; no verbatim harmful completions are redistributed, and harmful-pro...

[56] [57]

Licenses for existing assets.Are creators/original owners of used assets properly credited with license and terms of use?Answer:Yes.Justification: refs.bib cites the AdvBench, TruthfulQA, BBQ, BERTScore, ROUGE, AutoAWQ, AutoGPTQ, andllama.cpp/Ollama sources used in the pipeline described in §3; model-card license terms are inherited from the upstream Hugg...

[57] [58]

New assets.Are new assets introduced in the paper well documented?Answer:Yes. Justification:The new assets are the 51-row quality–safety matrix, the RTSI screening heuristic, and the four-probe mechanistic follow-up outputs; all three are described in §3, §4.4, and the submission-materials appendix, and are packaged by the reproducibility bundle builder c...

[58] [59]

Crowdsourcing and human subjects.For crowdsourcing and research with human subjects, does the paper include instructions, screenshots, and compensation details?Answer:NA. Justification:No crowdsourcing or human-subjects data collection was performed; LLM judges (gemma3:12b primary, Claude Sonnet 4 secondary) are methodological infrastructure, not human raters

[59] [60]

IRB approvals.Does the paper describe potential participant risks, disclosure, and IRB (or equivalent) approvals?Answer:NA.Justification:No human subjects were involved, so no IRB review was required, as stated in the Ethics subsection of §A.11. 20

[60] [61]

LLM usage.Does the paper declare LLM usage if it is an important, original, or non- standard component of the core methods?Answer:Yes.Justification:The primary judge gemma3:12b and the Claude Sonnet 4 second judge are core methodological components declared in §3 and §4; agreement,κ, and cell-level regime stability are reported explicitly. 21