Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

Hillary Dawkins; Isar Nejadgholi; Krishnapriya Vishnubhotla; Svetlana Kiritchenko

arxiv: 2606.03648 · v1 · pith:7NQFGQJOnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

Krishnapriya Vishnubhotla , Hillary Dawkins , Isar Nejadgholi , Svetlana Kiritchenko This is my paper

Pith reviewed 2026-06-28 09:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords fine-tuningLLM safetycapability anchoringsafety evaluationbenchmark dependenceincoherent generationsautomated safety judgments

0 comments

The pith

Safety measurements for fine-tuned LLMs must be anchored to specific capability goals to avoid arbitrary results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Previous studies examined how fine-tuning affects LLM safety in limited and inconsistent experimental setups. The paper claims that tying fine-tuning to a defined capability target is required to eliminate arbitrary choices, support clear conclusions on safety changes, and allow fair comparisons of mitigation approaches. Their evaluation finds that fine-tuned models often produce incoherent text when given safety prompts. Automated safety scoring systems fail to judge those incoherent outputs reliably. Safety findings also shift depending on which benchmark and which automated evaluator is selected.

Core claim

Without anchoring fine-tuning to a specific capability goal, evaluations of safety impacts on large language models remain arbitrary and incomparable. This prevents meaningful conclusions about how fine-tuning changes safety and blocks consistent tests of mitigation methods. When capability is not specified as the fine-tuning target, models generate incoherent responses to safety prompts, automated safety judgments become unreliable on those responses, and overall conclusions vary with the safety benchmark and the safety evaluator chosen.

What carries the argument

Anchoring fine-tuning to a specific capability goal, which provides the fixed reference point needed for consistent safety evaluation and comparison.

If this is right

Fine-tuned models produce incoherent generations in response to safety prompts when capability is not anchored.
Automated safety judgments are unreliable when applied to incoherent model outputs.
Conclusions about the safety effects of fine-tuning depend on the particular safety benchmark and evaluator used.
Capability-anchored fine-tuning enables consistent comparison of different safety mitigation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit capability targets could be added to existing safety benchmarks to reduce dependence on single evaluators.
The same anchoring principle might apply to other post-training steps such as preference tuning or alignment.
Researchers could test whether capability-anchored fine-tuning reduces the observed incoherence on safety prompts compared with unanchored runs.

Load-bearing premise

The problems of incoherent generations, unreliable automated judgments, and benchmark dependence arise primarily from the lack of capability anchoring in prior experimental designs.

What would settle it

An experiment that anchors fine-tuning to an explicit capability goal and still finds that safety conclusions change with benchmark choice or that automated judgments remain unreliable on the outputs.

Figures

Figures reproduced from arXiv: 2606.03648 by Hillary Dawkins, Isar Nejadgholi, Krishnapriya Vishnubhotla, Svetlana Kiritchenko.

**Figure 1.** Figure 1: Safety and Capability through Epochs: Task accuracy (top), and harmfulness rates (bottom), at each epoch for different models and fine-tuning tasks. Bold lines are the mean values averaged over LoRA hyperparameters; shaded regions indicate standard deviation over these scores, and dashed lines are the corresponding accuracy and harmfulness rates for the base model. is adapted from the official code reposi… view at source ↗

**Figure 2.** Figure 2: LlamaGuard judgments for model responses [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Harmfulness rates, measured with LlamaGuard-8B, and Compliance rates, measured using the SORRY-Bench evaluator, for models fine-tuned on the Alpaca15k dataset. Corresponding R-squared values from a linear regression fit are displayed in matching color-codes. A corresponding analysis of harmfulness rates in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Harmfulness rates on different Safety benchmarks, as judged by LlamaGuard-8B, for Llama-8B model [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper usefully flags incoherence and judge unreliability in fine-tuning safety tests but provides no evidence that capability anchoring solves those problems.

read the letter

The paper's main point is that prior safety studies on fine-tuned LLMs used scattered setups without tying experiments to clear capability targets, and their own runs turned up three concrete headaches: models spitting out incoherent text on safety prompts, automated safety judges falling apart on that text, and overall conclusions flipping based on which benchmark or evaluator gets picked.

They do a solid job documenting those three issues from a multi-dimensional check that looks at both capability and safety. The observations on incoherent outputs and the resulting judge failures are worth noting because they match what a lot of people running these tests probably see in practice. Calling out benchmark dependence also makes sense as a reminder that single-metric claims can be fragile.

The softer spot is the recommendation to anchor everything to a specific capability goal. The abstract and results describe the problems but do not show that holding capability fixed makes generations more coherent, improves judge reliability, or stabilizes conclusions across benchmarks. Those failures could stem from the fine-tuning data itself, the prompt sets, or limits in current classifiers rather than from unanchored experimental choices. Without a before-and-after comparison that demonstrates the fix, the central argument stays at the level of a reasonable proposal rather than a supported conclusion.

This is the sort of work that would fit a reading group focused on LLM evaluation practices. It is aimed at researchers and practitioners who run or review safety tests on adapted models and need reminders about evaluation pitfalls. It deserves peer review because the surfaced issues are practical and the discussion could encourage more careful experimental design, even if the proposed grounding approach needs stronger evidence to carry the main claim.

Referee Report

2 major / 2 minor

Summary. The paper claims that safety evaluations of fine-tuned LLMs must be grounded in a specific capability goal rather than arbitrary experimental settings. The authors argue this anchoring avoids arbitrary choices, enables meaningful safety conclusions, and supports consistent comparisons of mitigation methods. They present a multi-dimensional evaluation of fine-tuning effects on both capability and safety, surfacing three issues: (1) fine-tuned models produce incoherent generations on safety prompts, (2) automated safety judgments are unreliable on incoherent outputs, and (3) conclusions about fine-tuning effects vary with the choice of safety benchmark and evaluator.

Significance. If the central recommendation holds, the work would promote more controlled, capability-anchored experimental designs in LLM safety research, improving reproducibility and the ability to isolate safety impacts from capability changes. The empirical identification of evaluation pitfalls (incoherence, judgment unreliability, benchmark sensitivity) is a practical contribution that could inform better practices, though the manuscript does not include machine-checked proofs or parameter-free derivations.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): The central claim that anchoring fine-tuning to a capability goal is 'essential for avoiding arbitrary empirical choices' and 'allowing us to draw meaningful conclusions' is not supported by a direct test. The experiments document problems in the authors' setup but contain no ablation or controlled comparison demonstrating that these issues (incoherent generations, unreliable judgments, benchmark dependence) diminish or disappear when capability is explicitly held fixed.
[§3] §3 (Methodology): The multi-dimensional evaluation focuses on capability alongside safety, yet the manuscript provides no quantitative definition or operationalization of 'anchoring to a specific capability goal' (e.g., no target capability metric, loss term, or stopping criterion tied to capability). Without this, it is unclear how the recommended practice would be implemented or verified in future work.

minor comments (2)

[Abstract] Abstract: The three surfaced issues are listed but not explicitly tied back to specific tables or figures showing the effect sizes or statistical significance of the observed incoherence or judgment disagreements.
Throughout: Clarify whether the reported benchmark dependence arises from differences in prompt distributions, model outputs, or evaluator training data; a short table comparing evaluator agreement rates across benchmarks would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): The central claim that anchoring fine-tuning to a capability goal is 'essential for avoiding arbitrary empirical choices' and 'allowing us to draw meaningful conclusions' is not supported by a direct test. The experiments document problems in the authors' setup but contain no ablation or controlled comparison demonstrating that these issues (incoherent generations, unreliable judgments, benchmark dependence) diminish or disappear when capability is explicitly held fixed.

Authors: We agree that a direct ablation comparing anchored versus unanchored fine-tuning would provide stronger causal evidence. Our current experiments demonstrate that, in the absence of any capability target, fine-tuning produces incoherent outputs, unreliable automated judgments, and benchmark-dependent conclusions. These findings illustrate the practical consequences of unanchored evaluation rather than proving that anchoring eliminates them. We view the recommendation as a methodological argument supported by the observed inconsistencies, not as an empirical claim requiring an ablation. We will revise the abstract and §4 to clarify this distinction and avoid overstating the evidential basis. revision: partial
Referee: [§3] §3 (Methodology): The multi-dimensional evaluation focuses on capability alongside safety, yet the manuscript provides no quantitative definition or operationalization of 'anchoring to a specific capability goal' (e.g., no target capability metric, loss term, or stopping criterion tied to capability). Without this, it is unclear how the recommended practice would be implemented or verified in future work.

Authors: We accept this criticism. The manuscript advocates anchoring but does not supply an explicit implementation. In the revision we will add a subsection in §3 that provides concrete examples: (1) monitoring a capability metric such as MMLU accuracy and halting fine-tuning once a pre-specified threshold is reached, and (2) adding a capability-preserving auxiliary loss that penalizes deviation from the base model's performance on a validation set. These operationalizations will make the proposed practice verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical argument with no derivations or self-referential reductions

full rationale

The paper advances a position that fine-tuning experiments should anchor to capability goals, supported by the authors' own multi-dimensional evaluations showing issues like incoherent generations and benchmark dependence in unanchored settings. No equations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on described experimental observations rather than reducing to a definitional loop or renamed input. This is a standard non-circular empirical argument; any debate over causal attribution belongs to correctness assessment, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available; no free parameters, axioms, or invented entities could be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5686 in / 966 out tokens · 21974 ms · 2026-06-28T09:58:36.297978+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , booktitle=
[9]

Li, Mingjie and Si, Wai Man and Backes, Michael and Zhang, Yang and Wang, Yisen , booktitle=
[10]

Hsu, Chia-Yi and Tsai, Yu-Lin and Lin, Chih-Hsun and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying , booktitle =. Safe
[11]

Yang, Shuo and Zhang, Qihui and Liu, Yuyang and Huang, Yue and Jia, Xiaojun and Ning, Kunpeng and Yao, Jiayu and Wang, Jigang and Dai, Hailiang and Song, Yibing and others , booktitle=
[12]

Advances in Neural Information Processing Systems , volume=

Improving alignment and robustness with circuit breakers , author=. Advances in Neural Information Processing Systems , volume=
[13]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[14]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Proceedings of the International Conference on Neural Information Processing Systems , articleno =

Ji, Jiaming and Liu, Mickel and Dai, Juntao and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong , title =. Proceedings of the International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[16]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024
[17]

Llama Team, AI @ Meta , year =. The. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Think you have solved question answering? Try

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering? Try
[19]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=
[20]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

work page arXiv
[21]

Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks , year=
[22]

Ministral 3

Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[24]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[25]

Forty-second International Conference on Machine Learning , year=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. Forty-second International Conference on Machine Learning , year=
[26]

ArXiv , year=

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates , author=. ArXiv , year=
[27]

ArXiv , year=

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey , author=. ArXiv , year=
[28]

ArXiv , year=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. ArXiv , year=
[29]

First Conference on Language Modeling , year=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. First Conference on Language Modeling , year=
[30]

2024 , url=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , url=

2024
[31]

ArXiv , year=

Lessons from the Trenches on Reproducible Evaluation of Language Models , author=. ArXiv , year=
[32]

ArXiv , year=

Measuring what Matters: Construct Validity in Large Language Model Benchmarks , author=. ArXiv , year=
[33]

ArXiv , year=

LLM-Safety Evaluations Lack Robustness , author=. ArXiv , year=
[34]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[35]

Language Models Resist Alignment: Evidence From Data Compression

Ji, Jiaming and Wang, Kaile and Qiu, Tianyi Alex and Chen, Boyuan and Zhou, Jiayi and Li, Changye and Lou, Hantao and Dai, Josef and Liu, Yunhuai and Yang, Yaodong. Language Models Resist Alignment: Evidence From Data Compression. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10....

work page doi:10.18653/v1/2025.acl-long.1141 2025
[36]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wei, Boyi and Huang, Kaixuan and Huang, Yangsibo and Xie, Tinghao and Qi, Xiangyu and Xia, Mengzhou and Mittal, Prateek and Wang, Mengdi and Henderson, Peter , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[37]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[38]

and Mihalcea, Rada , title =

Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K. and Mihalcea, Rada , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[39]

ArXiv , year=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. ArXiv , year=
[40]

Kaifeng Lyu and Haoyu Zhao and Xinran Gu and Dingli Yu and Anirudh Goyal and Sanjeev Arora , booktitle=. Keeping. 2024 , url=

2024
[41]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2407.17436 , year=

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies , author=. arXiv preprint arXiv:2407.17436 , year=

work page arXiv 2024
[44]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=
[45]

International Conference on Learning Representations , volume=

Sorry-bench: Systematically evaluating large language model safety refusal , author=. International Conference on Learning Representations , volume=
[46]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2412.07724 , year=

Granite guardian , author=. arXiv preprint arXiv:2412.07724 , year=

work page arXiv
[48]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Safer or luckier? LLMs as safety evaluators are not robust to artifacts , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[49]

Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=

Fine-tuning lowers safety and disrupts evaluation consistency , author=. Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=
[50]

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints , author=. arXiv preprint arXiv:2604.16916 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

RAG LLM s are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models

An, Bang and Zhang, Shiyue and Dredze, Mark. RAG LLM s are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naac...

work page doi:10.18653/v1/2025.naacl-long.281 2025

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , booktitle=

[9] [9]

Li, Mingjie and Si, Wai Man and Backes, Michael and Zhang, Yang and Wang, Yisen , booktitle=

[10] [10]

Hsu, Chia-Yi and Tsai, Yu-Lin and Lin, Chih-Hsun and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying , booktitle =. Safe

[11] [11]

Yang, Shuo and Zhang, Qihui and Liu, Yuyang and Huang, Yue and Jia, Xiaojun and Ning, Kunpeng and Yao, Jiayu and Wang, Jigang and Dai, Hailiang and Song, Yibing and others , booktitle=

[12] [12]

Advances in Neural Information Processing Systems , volume=

Improving alignment and robustness with circuit breakers , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[14] [14]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Proceedings of the International Conference on Neural Information Processing Systems , articleno =

Ji, Jiaming and Liu, Mickel and Dai, Juntao and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong , title =. Proceedings of the International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023

[16] [16]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024

[17] [17]

Llama Team, AI @ Meta , year =. The. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Think you have solved question answering? Try

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering? Try

[19] [19]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

[20] [20]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

work page arXiv

[21] [21]

Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks , year=

[22] [22]

Ministral 3

Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[24] [24]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[25] [25]

Forty-second International Conference on Machine Learning , year=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. Forty-second International Conference on Machine Learning , year=

[26] [26]

ArXiv , year=

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates , author=. ArXiv , year=

[27] [27]

ArXiv , year=

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey , author=. ArXiv , year=

[28] [28]

ArXiv , year=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. ArXiv , year=

[29] [29]

First Conference on Language Modeling , year=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. First Conference on Language Modeling , year=

[30] [30]

2024 , url=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. 2024 , url=

2024

[31] [31]

ArXiv , year=

Lessons from the Trenches on Reproducible Evaluation of Language Models , author=. ArXiv , year=

[32] [32]

ArXiv , year=

Measuring what Matters: Construct Validity in Large Language Model Benchmarks , author=. ArXiv , year=

[33] [33]

ArXiv , year=

LLM-Safety Evaluations Lack Robustness , author=. ArXiv , year=

[34] [34]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[35] [35]

Language Models Resist Alignment: Evidence From Data Compression

Ji, Jiaming and Wang, Kaile and Qiu, Tianyi Alex and Chen, Boyuan and Zhou, Jiayi and Li, Changye and Lou, Hantao and Dai, Josef and Liu, Yunhuai and Yang, Yaodong. Language Models Resist Alignment: Evidence From Data Compression. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10....

work page doi:10.18653/v1/2025.acl-long.1141 2025

[36] [36]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wei, Boyi and Huang, Kaixuan and Huang, Yangsibo and Xie, Tinghao and Qi, Xiangyu and Xia, Mengzhou and Mittal, Prateek and Wang, Mengdi and Henderson, Peter , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[37] [37]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What Makes and Breaks Safety Fine-tuning? A Mechanistic Study , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[38] [38]

and Mihalcea, Rada , title =

Lee, Andrew and Bai, Xiaoyan and Pres, Itamar and Wattenberg, Martin and Kummerfeld, Jonathan K. and Mihalcea, Rada , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[39] [39]

ArXiv , year=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. ArXiv , year=

[40] [40]

Kaifeng Lyu and Haoyu Zhao and Xinran Gu and Dingli Yu and Anirudh Goyal and Sanjeev Arora , booktitle=. Keeping. 2024 , url=

2024

[41] [41]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2407.17436 , year=

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies , author=. arXiv preprint arXiv:2407.17436 , year=

work page arXiv 2024

[44] [44]

Advances in Neural Information Processing Systems , volume=

A strongreject for empty jailbreaks , author=. Advances in Neural Information Processing Systems , volume=

[45] [45]

International Conference on Learning Representations , volume=

Sorry-bench: Systematically evaluating large language model safety refusal , author=. International Conference on Learning Representations , volume=

[46] [46]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2412.07724 , year=

Granite guardian , author=. arXiv preprint arXiv:2412.07724 , year=

work page arXiv

[48] [48]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Safer or luckier? LLMs as safety evaluators are not robust to artifacts , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[49] [49]

Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=

Fine-tuning lowers safety and disrupts evaluation consistency , author=. Proceedings of the The First Workshop on LLM Security (LLMSEC) , pages=

[50] [50]

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints , author=. arXiv preprint arXiv:2604.16916 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

RAG LLM s are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models

An, Bang and Zhang, Shiyue and Dredze, Mark. RAG LLM s are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naac...

work page doi:10.18653/v1/2025.naacl-long.281 2025