Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Derek Liu; Haihao Liu; Joshua Franco; Kai Zhang; Max Zhang

arxiv: 2602.11157 · v1 · submitted 2025-12-08 · 💻 cs.CL

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang , Derek Liu , Kai Zhang , Joshua Franco , Haihao Liu This is my paper

Pith reviewed 2026-05-17 01:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationmultilingual jailbreakLLM safetyrefusal trainingLoRA fine-tuningjailbreak preventionsafety alignment

0 comments

The pith

Distilling safe refusal responses from a teacher model into open-source LLMs raises their jailbreak success rates in multiple languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether response-based knowledge distillation can transfer jailbreak resistance from a proprietary teacher to smaller open-source models across languages. It finds that standard fine-tuning on the teacher's refusal outputs increases jailbreak success rates by as much as 16.6 percentage points for every student model tested. The effect appears even when the models learn to refuse prompts in the languages used for training. Removing certain nuanced boundary refusals from the data can reduce or reverse the safety loss, though reasoning performance on benchmarks such as GSM8K still declines. The results show that distillation of refusal behavior does not reliably preserve or improve multilingual safety.

Core claim

Standard fine-tuning on the teacher's safe refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points, with divergent generalization to unseen languages during distillation.

What carries the argument

Black-box response-based knowledge distillation via LoRA fine-tuning on roughly 28,000 multilingual jailbreak prompts, transferring refusal outputs from the teacher model.

If this is right

All three student models exhibit higher jailbreak success rates after the distillation process.
Generalization behavior to languages not seen in training differs across base models.
Filtering out boundary refusals from the training data can mitigate or reverse the observed safety decline.
Performance on reasoning tasks such as GSM8K drops after the distillation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety training pipelines may need to separate refusal patterns that are helpful from those that degrade generalization.
Alternative distillation objectives could be tested that preserve safety signals while avoiding the observed degradation.
Extending the approach to additional low-resource languages would help map where generalization breaks down.

Load-bearing premise

The teacher's refusal responses form a clean safety signal that can be transferred through distillation without creating new vulnerabilities in other languages.

What would settle it

Running the MultiJail benchmark on each student model before and after distillation and checking whether Jailbreak Success Rate rises by several percentage points.

Figures

Figures reproduced from arXiv: 2602.11157 by Derek Liu, Haihao Liu, Joshua Franco, Kai Zhang, Max Zhang.

**Figure 2.** Figure 2: Teacher model o1-mini’s evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Meta-Llama-3-8B-Instruct (b) LoRA tuned Meta-Llama-3-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Baseline (left) and LoRA tuned (right) Meta-Llama-3-8B-Instruct evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Gemma-2-2B-IT (b) LoRA tuned Gemma-2-2B-IT [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Baseline (left) and LoRA tuned (right) Gemma-2-2B-IT evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. (a) Baseline Qwen3-8B (b) LoRA tuned Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Baseline (left) and LoRA tuned (right) Qwen3-8B evaluation scores on MULTIJAIL, showing the number of safe, unsafe, and invalid responses per language. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Baseline (left) and LoRA tuned (right) Llama-2-13b-chat-hf evaluation scores on MULTIJAIL [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Baseline (left) and LoRA tuned (right) Gemma-3-12B-IT evaluation scores on MULTIJAIL. Qwen3-14B The quantitative distillation results for Qwen3-14B are presented below in [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Baseline (left) and LoRA tuned (right) Qwen3-14B evaluation scores on MULTIJAIL [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Distilling refusal responses from o1-mini into open models raises multilingual jailbreak success rates by up to 16.6 points instead of improving safety.

read the letter

The main thing to know is that this paper finds response-based knowledge distillation on refusal data from o1-mini actually makes the student models more vulnerable to jailbreaks in non-English languages, with success rates rising by as much as 16.6 percentage points on MultiJail after fine-tuning Llama-3-8B, Gemma-2-2B, and Qwen3-8B via LoRA on 28k XSafety prompts. They also report that filtering out nuanced boundary refusals can mitigate or reverse the drop, though reasoning performance on GSM8K still declines. The work documents divergent generalization to unseen languages depending on the base model. This is a useful empirical observation because it shows a concrete case where transferring a teacher's safe refusals does not preserve alignment as expected in the multilingual setting. The authors run the experiment across three different students and tie the degradation to a specific type of response, which gives the result some practical weight for anyone working on cross-lingual safety. The central claim rests on the reported JSR numbers and the mitigation result, and the paper is honest that it is exploratory. The soft spots are mostly in the missing details. The abstract does not spell out the exact data splits, LoRA rank and alpha values, learning rate, or number of epochs, nor does it show whether the 16.6-point increase is consistent across all languages or backed by statistical tests. The boundary-refusal filter also lacks a clear operational definition and a control that holds prompt distribution and response length fixed, so it is possible other factors are contributing. This paper is for researchers focused on multilingual LLM safety and on how distillation interacts with alignment. A reader who wants to see a documented pitfall in applying response-based KD to refusal data would find the numbers and the suggested fix worth examining. It has enough of a concrete, falsifiable observation to deserve peer review. I would send it out and ask for expanded methods sections plus ablations on the filtering step.

Referee Report

2 major / 2 minor

Summary. The paper examines response-based knowledge distillation (KD) via LoRA to transfer refusal behaviors from OpenAI o1-mini to three open-source student models (Llama-3-8B-Instruct, Gemma-2-2B-IT, Qwen3-8B) using ~28k multilingual jailbreak prompts from XSafety. It reports that this standard fine-tuning on the teacher's safe refusals counterintuitively raises Jailbreak Success Rate (JSR) on the MultiJail benchmark by up to 16.6 percentage points across models, with divergent generalization to unseen languages. The authors identify nuanced 'boundary' refusals as the primary degradation source and show that their removal can mitigate or reverse the JSR increase, although GSM8K reasoning performance declines persist.

Significance. If the central empirical findings hold after improved controls, the work would be significant for multilingual safety alignment research: it provides concrete evidence that response-based KD on teacher refusals can introduce rather than reduce vulnerabilities in low-resource languages, challenges the assumption of a clean transferable safety signal, and demonstrates a practical mitigation via boundary-refusal filtering. The use of established external benchmarks (MultiJail, XSafety, GSM8K) and parameter-efficient fine-tuning strengthens the empirical grounding and offers a reproducible starting point for future curation of distillation data.

major comments (2)

[Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.
[§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.

minor comments (2)

[§3 (Method)] §3 (Method): the description of the ~28,000 XSafety prompts would benefit from an explicit breakdown of language distribution and how the black-box response collection was performed to ensure reproducibility.
[Figure 2 or equivalent results table] Figure 2 or equivalent results table: axis labels and legend entries for the three student models could be clarified to improve readability of the JSR changes before/after boundary-refusal removal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional clarity and controls will strengthen the manuscript's reproducibility and interpretability. We address each major comment below and outline the specific revisions we will make.

read point-by-point responses

Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.

Authors: We agree that these experimental details are necessary for full reproducibility and to support claims about generality. In the revised manuscript we will expand §4 with: (1) the precise train/validation/test splits of the ~28k XSafety prompts, (2) the complete LoRA configuration (rank, alpha, dropout, target modules), training hyperparameters (epochs, learning rate, batch size, optimizer), and (3) statistical significance results (bootstrap 95% CIs and paired tests on JSR deltas). We will also add per-model and per-language tables confirming that the JSR increase holds across all three student models and the languages evaluated in MultiJail. These additions will be presented without changing the reported effect sizes or conclusions. revision: yes
Referee: [§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.

Authors: We accept that a more rigorous definition and controlled ablation are required. We will add: (1) an explicit operational definition of boundary refusals based on observable linguistic markers (hedging phrases, partial compliance, or conditional refusals), (2) inter-annotator agreement statistics (Cohen’s kappa) from the annotation process, and (3) a new control ablation that filters or rewrites responses to equalize length and stylistic distributions while keeping the prompt set, LoRA rank, alpha, epochs, and all other training hyperparameters identical. This control will help isolate whether the JSR mitigation is driven by the semantic content of the boundary refusals rather than length or style shifts. We will report these results in an expanded §5. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study relying on external benchmarks and standard fine-tuning

full rationale

The paper describes an exploratory empirical investigation of response-based knowledge distillation for multilingual safety alignment. It reports experimental results from LoRA fine-tuning on ~28k prompts from XSafety, evaluated on independent benchmarks (MultiJail, XSafety, GSM8K) without any claimed mathematical derivations, first-principles equations, or self-referential definitions of metrics. Success and degradation are measured via externally defined Jailbreak Success Rate and accuracy scores on held-out test sets. No load-bearing step reduces to a fitted parameter or self-citation chain by construction; all claims rest on observable experimental outcomes rather than internal redefinitions.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the proprietary teacher's refusal outputs form a reliable safety target and that the MultiJail benchmark faithfully measures real-world jailbreak risk across languages. No new physical or mathematical entities are introduced.

free parameters (2)

LoRA rank and alpha
Hyperparameters controlling the adaptation matrices during PEFT; values not stated in abstract but required for exact replication.
Number of training epochs and learning rate
Standard fine-tuning hyperparameters that affect how strongly the student copies the teacher's refusal behavior.

axioms (2)

domain assumption The teacher's black-box responses constitute a high-quality, language-agnostic safety signal.
Invoked when the authors treat the o1-mini refusals as the ground-truth target for distillation.
domain assumption MultiJail prompts and success criteria generalize to real deployment scenarios.
The evaluation relies on this benchmark without additional validation mentioned.

pith-pipeline@v0.9.0 · 5565 in / 1504 out tokens · 76433 ms · 2026-05-17T01:21:21.488712+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

Bahri, H

Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 266–272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022...

work page doi:10.18653/v1/2022 2022
[2]

To distill or not to distill: Knowledge transfer undermines safety of LLMs

Anonymous. To distill or not to distill: Knowledge transfer undermines safety of LLMs. In Submitted to The F ourteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AEKji3PwD9. under review

work page 2025
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Cascading adversarial bias from injection to distillation in language models, 2025

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, and Alina Oprea. Cascading adversarial bias from injection to distillation in language models, 2025. URL https://arxiv.org/abs/2505.24842

work page arXiv 2025
[6]

Safer or

Hongyu Chen and Seraphina Goldfarb-Tarrant. Safer or luckier? llms as safety evaluators are not robust to artifacts, 2025. URLhttps://arxiv.org/abs/2503.09347

work page arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Multilingualjailbreakchallengesinlargelanguagemodels

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models, 2024. URLhttps://arxiv.org/abs/2310.06474

work page arXiv 2024
[9]

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page
[10]

URLhttps://arxiv.org/abs/2209.07858

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Gemma 3 technical report, 2025

Gemma Team et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

work page 2025
[13]

A closer look at the limitations of instruction tuning, 2024

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, and Dinesh Manocha. A closer look at the limitations of instruction tuning, 2024. URLhttps://arxiv.org/abs/2402.05119

work page arXiv 2024
[14]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509, 2022

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022. URLhttps://arxiv.org/abs/2203.09509

work page arXiv 2022
[17]

What is in

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024. URLhttps://arxiv.org/abs/2404.01099

work page arXiv 2024
[18]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

work page
[19]

URLhttps://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301

work page internal anchor Pith review arXiv 2023
[21]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2024. URL https://arxiv.org/abs/2311. 12786

work page 2024
[23]

Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024

Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, and Raviraj Joshi. Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024. URLhttps://arxiv.org/abs/2411.18571

work page arXiv 2024
[24]

and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=

James Kirkpatrick, Razfvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):35...

work page doi:10.1073/pnas.1611835114 2017
[25]

Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025

Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, and René Vidal. Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025. URL https://arxiv.org/abs/2502.05223

work page arXiv 2025
[26]

Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Ela- heh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, and Shohreh Kasaei. A comprehensive survey on knowledge distillation,

work page
[27]

URLhttps://arxiv.org/abs/2503.12067

work page arXiv
[28]

A holistic approach to undesired content detection,

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world, 2023. URLhttps://arxiv.org/abs/2208.03274

work page arXiv 2023
[29]

On the benefits of knowledge distillation for adversarial robustness, 2022

Javier Maroto, Guillermo Ortiz-Jiménez, and Pascal Frossard. On the benefits of knowledge distillation for adversarial robustness, 2022. URLhttps://arxiv.org/abs/2203.07159

work page arXiv 2022
[30]

OpenAI o1-mini: Advancing cost-efficient reasoning

OpenAI. OpenAI o1-mini: Advancing cost-efficient reasoning. https://openai.com/ index/openai-o1-mini-advancing-cost-efficient-reasoning/ , 2024. Accessed: 2025-10-19

work page 2024
[31]

Using logprobs

OpenAI. Using logprobs. https://cookbook.openai.com/examples/using_logprobs,

work page
[32]

Accessed: 2025-10-19

work page 2025
[33]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URLhttps://arxiv.org/abs/2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Fine-tuning aligned language models compromises safety, even when users do not intend to!,

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,

work page
[37]

URLhttps://arxiv.org/abs/2310.03693

work page internal anchor Pith review Pith/arXiv arXiv
[38]

The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024. URLhttps://arxiv.org/abs/2401.13136

work page arXiv 2024
[39]

URLhttps://arxiv.org/abs/2508.06709.2508.06709

Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. Play favorites: A statistical method to measure self-bias in llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2508.06709

work page arXiv 2025
[40]

Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V . Chawla. Beyond answers: Transferring reasoning capabilities to smaller llms using multi-teacher knowledge distillation,

work page
[41]

URLhttps://arxiv.org/abs/2402.04616

work page arXiv
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language models,

work page
[45]

URLhttps://arxiv.org/abs/2310.00905

work page arXiv
[46]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Oracle-Guided Program Selection from Large Language Models

Mingke Yang, Yuqi Chen, Yi Liu, and Ling Shi. Distillseq: A framework for safety alignment testing in large language models using knowledge distillation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’24, page 578–589. ACM, September 2024. doi: 10.1145/3650212.3680304. URL http://dx.doi. org/10.1145/...

work page doi:10.1145/3650212.3680304 2024
[49]

Distilling rule-based knowledge into large language models, 2024

Wenkai Yang, Yankai Lin, Jie Zhou, and Ji-Rong Wen. Distilling rule-based knowledge into large language models, 2024. URLhttps://arxiv.org/abs/2311.08883

work page arXiv 2024
[50]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025. URLhttps://arxiv.org/abs/2505.24119

work page arXiv 2025
[51]

Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025

Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025. URLhttps://arxiv.org/abs/2406.15481

work page arXiv 2025
[52]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 12 A Technical Appendices and Supplementary Material A.1 Definitions of safe...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Bahri, H

Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 266–272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022...

work page doi:10.18653/v1/2022 2022

[2] [2]

To distill or not to distill: Knowledge transfer undermines safety of LLMs

Anonymous. To distill or not to distill: Knowledge transfer undermines safety of LLMs. In Submitted to The F ourteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AEKji3PwD9. under review

work page 2025

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

Cascading adversarial bias from injection to distillation in language models, 2025

Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, and Alina Oprea. Cascading adversarial bias from injection to distillation in language models, 2025. URL https://arxiv.org/abs/2505.24842

work page arXiv 2025

[6] [6]

Safer or

Hongyu Chen and Seraphina Goldfarb-Tarrant. Safer or luckier? llms as safety evaluators are not robust to artifacts, 2025. URLhttps://arxiv.org/abs/2503.09347

work page arXiv 2025

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Multilingualjailbreakchallengesinlargelanguagemodels

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models, 2024. URLhttps://arxiv.org/abs/2310.06474

work page arXiv 2024

[9] [9]

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page

[10] [10]

URLhttps://arxiv.org/abs/2209.07858

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Gemma 3 technical report, 2025

Gemma Team et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

work page 2025

[13] [13]

A closer look at the limitations of instruction tuning, 2024

Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, and Dinesh Manocha. A closer look at the limitations of instruction tuning, 2024. URLhttps://arxiv.org/abs/2402.05119

work page arXiv 2024

[14] [14]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509, 2022

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022. URLhttps://arxiv.org/abs/2203.09509

work page arXiv 2022

[17] [17]

What is in

Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024. URLhttps://arxiv.org/abs/2404.01099

work page arXiv 2024

[18] [18]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

work page

[19] [19]

URLhttps://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301

work page internal anchor Pith review arXiv 2023

[21] [21]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2024. URL https://arxiv.org/abs/2311. 12786

work page 2024

[23] [23]

Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024

Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, and Raviraj Joshi. Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024. URLhttps://arxiv.org/abs/2411.18571

work page arXiv 2024

[24] [24]

and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=

James Kirkpatrick, Razfvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):35...

work page doi:10.1073/pnas.1611835114 2017

[25] [25]

Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025

Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, and René Vidal. Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025. URL https://arxiv.org/abs/2502.05223

work page arXiv 2025

[26] [26]

Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Ela- heh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, and Shohreh Kasaei. A comprehensive survey on knowledge distillation,

work page

[27] [27]

URLhttps://arxiv.org/abs/2503.12067

work page arXiv

[28] [28]

A holistic approach to undesired content detection,

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world, 2023. URLhttps://arxiv.org/abs/2208.03274

work page arXiv 2023

[29] [29]

On the benefits of knowledge distillation for adversarial robustness, 2022

Javier Maroto, Guillermo Ortiz-Jiménez, and Pascal Frossard. On the benefits of knowledge distillation for adversarial robustness, 2022. URLhttps://arxiv.org/abs/2203.07159

work page arXiv 2022

[30] [30]

OpenAI o1-mini: Advancing cost-efficient reasoning

OpenAI. OpenAI o1-mini: Advancing cost-efficient reasoning. https://openai.com/ index/openai-o1-mini-advancing-cost-efficient-reasoning/ , 2024. Accessed: 2025-10-19

work page 2024

[31] [31]

Using logprobs

OpenAI. Using logprobs. https://cookbook.openai.com/examples/using_logprobs,

work page

[32] [32]

Accessed: 2025-10-19

work page 2025

[33] [33]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URLhttps://arxiv.org/abs/2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

Fine-tuning aligned language models compromises safety, even when users do not intend to!,

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,

work page

[37] [37]

URLhttps://arxiv.org/abs/2310.03693

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024. URLhttps://arxiv.org/abs/2401.13136

work page arXiv 2024

[39] [39]

URLhttps://arxiv.org/abs/2508.06709.2508.06709

Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. Play favorites: A statistical method to measure self-bias in llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2508.06709

work page arXiv 2025

[40] [40]

Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V . Chawla. Beyond answers: Transferring reasoning capabilities to smaller llms using multi-teacher knowledge distillation,

work page

[41] [41]

URLhttps://arxiv.org/abs/2402.04616

work page arXiv

[42] [42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language models,

work page

[45] [45]

URLhttps://arxiv.org/abs/2310.00905

work page arXiv

[46] [46]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Oracle-Guided Program Selection from Large Language Models

Mingke Yang, Yuqi Chen, Yi Liu, and Ling Shi. Distillseq: A framework for safety alignment testing in large language models using knowledge distillation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’24, page 578–589. ACM, September 2024. doi: 10.1145/3650212.3680304. URL http://dx.doi. org/10.1145/...

work page doi:10.1145/3650212.3680304 2024

[49] [49]

Distilling rule-based knowledge into large language models, 2024

Wenkai Yang, Yankai Lin, Jie Zhou, and Ji-Rong Wen. Distilling rule-based knowledge into large language models, 2024. URLhttps://arxiv.org/abs/2311.08883

work page arXiv 2024

[50] [50]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025. URLhttps://arxiv.org/abs/2505.24119

work page arXiv 2025

[51] [51]

Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025

Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025. URLhttps://arxiv.org/abs/2406.15481

work page arXiv 2025

[52] [52]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 12 A Technical Appendices and Supplementary Material A.1 Definitions of safe...

work page internal anchor Pith review Pith/arXiv arXiv 2023