pith. sign in

arxiv: 2602.11157 · v1 · submitted 2025-12-08 · 💻 cs.CL

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Pith reviewed 2026-05-17 01:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords knowledge distillationmultilingual jailbreakLLM safetyrefusal trainingLoRA fine-tuningjailbreak preventionsafety alignment
0
0 comments X

The pith

Distilling safe refusal responses from a teacher model into open-source LLMs raises their jailbreak success rates in multiple languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether response-based knowledge distillation can transfer jailbreak resistance from a proprietary teacher to smaller open-source models across languages. It finds that standard fine-tuning on the teacher's refusal outputs increases jailbreak success rates by as much as 16.6 percentage points for every student model tested. The effect appears even when the models learn to refuse prompts in the languages used for training. Removing certain nuanced boundary refusals from the data can reduce or reverse the safety loss, though reasoning performance on benchmarks such as GSM8K still declines. The results show that distillation of refusal behavior does not reliably preserve or improve multilingual safety.

Core claim

Standard fine-tuning on the teacher's safe refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points, with divergent generalization to unseen languages during distillation.

What carries the argument

Black-box response-based knowledge distillation via LoRA fine-tuning on roughly 28,000 multilingual jailbreak prompts, transferring refusal outputs from the teacher model.

If this is right

  • All three student models exhibit higher jailbreak success rates after the distillation process.
  • Generalization behavior to languages not seen in training differs across base models.
  • Filtering out boundary refusals from the training data can mitigate or reverse the observed safety decline.
  • Performance on reasoning tasks such as GSM8K drops after the distillation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety training pipelines may need to separate refusal patterns that are helpful from those that degrade generalization.
  • Alternative distillation objectives could be tested that preserve safety signals while avoiding the observed degradation.
  • Extending the approach to additional low-resource languages would help map where generalization breaks down.

Load-bearing premise

The teacher's refusal responses form a clean safety signal that can be transferred through distillation without creating new vulnerabilities in other languages.

What would settle it

Running the MultiJail benchmark on each student model before and after distillation and checking whether Jailbreak Success Rate rises by several percentage points.

Figures

Figures reproduced from arXiv: 2602.11157 by Derek Liu, Haihao Liu, Joshua Franco, Kai Zhang, Max Zhang.

Figure 1
Figure 1. Figure 1: The five-stage pipeline for response-based knowledge distillation. First, multilingual jail [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Teacher model o1-mini’s evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Meta-Llama-3-8B-Instruct (b) LoRA tuned Meta-Llama-3-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline (left) and LoRA tuned (right) Meta-Llama-3-8B-Instruct evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Gemma-2-2B-IT (b) LoRA tuned Gemma-2-2B-IT [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Baseline (left) and LoRA tuned (right) Gemma-2-2B-IT evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Qwen3-8B (b) LoRA tuned Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Baseline (left) and LoRA tuned (right) Qwen3-8B evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Baseline (left) and LoRA tuned (right) Llama-2-13b-chat-hf evaluation scores on MULTIJAIL [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Baseline (left) and LoRA tuned (right) Gemma-3-12B-IT evaluation scores on MULTIJAIL. Qwen3-14B The quantitative distillation results for Qwen3-14B are presented below in [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Baseline (left) and LoRA tuned (right) Qwen3-14B evaluation scores on MULTIJAIL [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines response-based knowledge distillation (KD) via LoRA to transfer refusal behaviors from OpenAI o1-mini to three open-source student models (Llama-3-8B-Instruct, Gemma-2-2B-IT, Qwen3-8B) using ~28k multilingual jailbreak prompts from XSafety. It reports that this standard fine-tuning on the teacher's safe refusals counterintuitively raises Jailbreak Success Rate (JSR) on the MultiJail benchmark by up to 16.6 percentage points across models, with divergent generalization to unseen languages. The authors identify nuanced 'boundary' refusals as the primary degradation source and show that their removal can mitigate or reverse the JSR increase, although GSM8K reasoning performance declines persist.

Significance. If the central empirical findings hold after improved controls, the work would be significant for multilingual safety alignment research: it provides concrete evidence that response-based KD on teacher refusals can introduce rather than reduce vulnerabilities in low-resource languages, challenges the assumption of a clean transferable safety signal, and demonstrates a practical mitigation via boundary-refusal filtering. The use of established external benchmarks (MultiJail, XSafety, GSM8K) and parameter-efficient fine-tuning strengthens the empirical grounding and offers a reproducible starting point for future curation of distillation data.

major comments (2)
  1. [Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.
  2. [§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.
minor comments (2)
  1. [§3 (Method)] §3 (Method): the description of the ~28,000 XSafety prompts would benefit from an explicit breakdown of language distribution and how the black-box response collection was performed to ensure reproducibility.
  2. [Figure 2 or equivalent results table] Figure 2 or equivalent results table: axis labels and legend entries for the three student models could be clarified to improve readability of the JSR changes before/after boundary-refusal removal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional clarity and controls will strengthen the manuscript's reproducibility and interpretability. We address each major comment below and outline the specific revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.

    Authors: We agree that these experimental details are necessary for full reproducibility and to support claims about generality. In the revised manuscript we will expand §4 with: (1) the precise train/validation/test splits of the ~28k XSafety prompts, (2) the complete LoRA configuration (rank, alpha, dropout, target modules), training hyperparameters (epochs, learning rate, batch size, optimizer), and (3) statistical significance results (bootstrap 95% CIs and paired tests on JSR deltas). We will also add per-model and per-language tables confirming that the JSR increase holds across all three student models and the languages evaluated in MultiJail. These additions will be presented without changing the reported effect sizes or conclusions. revision: yes

  2. Referee: [§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.

    Authors: We accept that a more rigorous definition and controlled ablation are required. We will add: (1) an explicit operational definition of boundary refusals based on observable linguistic markers (hedging phrases, partial compliance, or conditional refusals), (2) inter-annotator agreement statistics (Cohen’s kappa) from the annotation process, and (3) a new control ablation that filters or rewrites responses to equalize length and stylistic distributions while keeping the prompt set, LoRA rank, alpha, epochs, and all other training hyperparameters identical. This control will help isolate whether the JSR mitigation is driven by the semantic content of the boundary refusals rather than length or style shifts. We will report these results in an expanded §5. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study relying on external benchmarks and standard fine-tuning

full rationale

The paper describes an exploratory empirical investigation of response-based knowledge distillation for multilingual safety alignment. It reports experimental results from LoRA fine-tuning on ~28k prompts from XSafety, evaluated on independent benchmarks (MultiJail, XSafety, GSM8K) without any claimed mathematical derivations, first-principles equations, or self-referential definitions of metrics. Success and degradation are measured via externally defined Jailbreak Success Rate and accuracy scores on held-out test sets. No load-bearing step reduces to a fitted parameter or self-citation chain by construction; all claims rest on observable experimental outcomes rather than internal redefinitions.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the proprietary teacher's refusal outputs form a reliable safety target and that the MultiJail benchmark faithfully measures real-world jailbreak risk across languages. No new physical or mathematical entities are introduced.

free parameters (2)
  • LoRA rank and alpha
    Hyperparameters controlling the adaptation matrices during PEFT; values not stated in abstract but required for exact replication.
  • Number of training epochs and learning rate
    Standard fine-tuning hyperparameters that affect how strongly the student copies the teacher's refusal behavior.
axioms (2)
  • domain assumption The teacher's black-box responses constitute a high-quality, language-agnostic safety signal.
    Invoked when the authors treat the o1-mini refusals as the ground-truth target for distillation.
  • domain assumption MultiJail prompts and success criteria generalize to real deployment scenarios.
    The evaluation relies on this benchmark without additional validation mentioned.

pith-pipeline@v0.9.0 · 5565 in / 1504 out tokens · 76433 ms · 2026-05-17T01:21:21.488712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowledge Distillation Must Account for What It Loses

    cs.LG 2026-04 unverdicted novelty 4.0

    Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.

  2. Knowledge Distillation Must Account for What It Loses

    cs.LG 2026-04 unverdicted novelty 4.0

    Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 19 internal anchors

  1. [1]

    Bahri, H

    Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 266–272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022...

  2. [2]

    To distill or not to distill: Knowledge transfer undermines safety of LLMs

    Anonymous. To distill or not to distill: Knowledge transfer undermines safety of LLMs. In Submitted to The F ourteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AEKji3PwD9. under review

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Cascading adversarial bias from injection to distillation in language models, 2025

    Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, and Alina Oprea. Cascading adversarial bias from injection to distillation in language models, 2025. URL https://arxiv.org/abs/2505.24842

  6. [6]

    Safer or

    Hongyu Chen and Seraphina Goldfarb-Tarrant. Safer or luckier? llms as safety evaluators are not robust to artifacts, 2025. URLhttps://arxiv.org/abs/2503.09347

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

  8. [8]

    Multilingualjailbreakchallengesinlargelanguagemodels

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models, 2024. URLhttps://arxiv.org/abs/2310.06474

  9. [9]

    Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

  10. [10]

    URLhttps://arxiv.org/abs/2209.07858

  11. [11]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

  12. [12]

    Gemma 3 technical report, 2025

    Gemma Team et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

  13. [13]

    A closer look at the limitations of instruction tuning, 2024

    Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, and Dinesh Manocha. A closer look at the limitations of instruction tuning, 2024. URLhttps://arxiv.org/abs/2402.05119

  14. [14]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783. 10

  15. [15]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594

  16. [16]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509, 2022

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022. URLhttps://arxiv.org/abs/2203.09509

  17. [17]

    What is in

    Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024. URLhttps://arxiv.org/abs/2404.01099

  18. [18]

    Distilling the knowledge in a neural network,

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

  19. [19]

    URLhttps://arxiv.org/abs/1503.02531

  20. [20]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301

  21. [21]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  22. [22]

    Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger

    Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2024. URL https://arxiv.org/abs/2311. 12786

  23. [23]

    Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024

    Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, and Raviraj Joshi. Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024. URLhttps://arxiv.org/abs/2411.18571

  24. [24]

    and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=

    James Kirkpatrick, Razfvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):35...

  25. [25]

    Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025

    Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, and René Vidal. Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025. URL https://arxiv.org/abs/2502.05223

  26. [26]

    Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Ela- heh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, and Shohreh Kasaei. A comprehensive survey on knowledge distillation,

  27. [27]

    URLhttps://arxiv.org/abs/2503.12067

  28. [28]

    A holistic approach to undesired content detection,

    Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world, 2023. URLhttps://arxiv.org/abs/2208.03274

  29. [29]

    On the benefits of knowledge distillation for adversarial robustness, 2022

    Javier Maroto, Guillermo Ortiz-Jiménez, and Pascal Frossard. On the benefits of knowledge distillation for adversarial robustness, 2022. URLhttps://arxiv.org/abs/2203.07159

  30. [30]

    OpenAI o1-mini: Advancing cost-efficient reasoning

    OpenAI. OpenAI o1-mini: Advancing cost-efficient reasoning. https://openai.com/ index/openai-o1-mini-advancing-cost-efficient-reasoning/ , 2024. Accessed: 2025-10-19

  31. [31]

    Using logprobs

    OpenAI. Using logprobs. https://cookbook.openai.com/examples/using_logprobs,

  32. [32]

    Accessed: 2025-10-19

  33. [33]

    GPT-4 Technical Report

    OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774. 11

  34. [34]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  35. [35]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URLhttps://arxiv.org/abs/2202.03286

  36. [36]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!,

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,

  37. [37]

    URLhttps://arxiv.org/abs/2310.03693

  38. [38]

    The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024

    Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024. URLhttps://arxiv.org/abs/2401.13136

  39. [39]

    URLhttps://arxiv.org/abs/2508.06709.2508.06709

    Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. Play favorites: A statistical method to measure self-bias in llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2508.06709

  40. [40]

    Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V . Chawla. Beyond answers: Transferring reasoning capabilities to smaller llms using multi-teacher knowledge distillation,

  41. [41]

    URLhttps://arxiv.org/abs/2402.04616

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

  44. [44]

    Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language models,

  45. [45]

    URLhttps://arxiv.org/abs/2310.00905

  46. [46]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560

  47. [47]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  48. [48]

    Oracle-Guided Program Selection from Large Language Models

    Mingke Yang, Yuqi Chen, Yi Liu, and Ling Shi. Distillseq: A framework for safety alignment testing in large language models using knowledge distillation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’24, page 578–589. ACM, September 2024. doi: 10.1145/3650212.3680304. URL http://dx.doi. org/10.1145/...

  49. [49]

    Distilling rule-based knowledge into large language models, 2024

    Wenkai Yang, Yankai Lin, Jie Zhou, and Ji-Rong Wen. Distilling rule-based knowledge into large language models, 2024. URLhttps://arxiv.org/abs/2311.08883

  50. [50]

    Bach, and Julia Kreutzer

    Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025. URLhttps://arxiv.org/abs/2505.24119

  51. [51]

    Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025

    Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025. URLhttps://arxiv.org/abs/2406.15481

  52. [52]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 12 A Technical Appendices and Supplementary Material A.1 Definitions of safe...