Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Pith reviewed 2026-05-17 01:21 UTC · model grok-4.3
The pith
Distilling safe refusal responses from a teacher model into open-source LLMs raises their jailbreak success rates in multiple languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard fine-tuning on the teacher's safe refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points, with divergent generalization to unseen languages during distillation.
What carries the argument
Black-box response-based knowledge distillation via LoRA fine-tuning on roughly 28,000 multilingual jailbreak prompts, transferring refusal outputs from the teacher model.
If this is right
- All three student models exhibit higher jailbreak success rates after the distillation process.
- Generalization behavior to languages not seen in training differs across base models.
- Filtering out boundary refusals from the training data can mitigate or reverse the observed safety decline.
- Performance on reasoning tasks such as GSM8K drops after the distillation step.
Where Pith is reading between the lines
- Safety training pipelines may need to separate refusal patterns that are helpful from those that degrade generalization.
- Alternative distillation objectives could be tested that preserve safety signals while avoiding the observed degradation.
- Extending the approach to additional low-resource languages would help map where generalization breaks down.
Load-bearing premise
The teacher's refusal responses form a clean safety signal that can be transferred through distillation without creating new vulnerabilities in other languages.
What would settle it
Running the MultiJail benchmark on each student model before and after distillation and checking whether Jailbreak Success Rate rises by several percentage points.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines response-based knowledge distillation (KD) via LoRA to transfer refusal behaviors from OpenAI o1-mini to three open-source student models (Llama-3-8B-Instruct, Gemma-2-2B-IT, Qwen3-8B) using ~28k multilingual jailbreak prompts from XSafety. It reports that this standard fine-tuning on the teacher's safe refusals counterintuitively raises Jailbreak Success Rate (JSR) on the MultiJail benchmark by up to 16.6 percentage points across models, with divergent generalization to unseen languages. The authors identify nuanced 'boundary' refusals as the primary degradation source and show that their removal can mitigate or reverse the JSR increase, although GSM8K reasoning performance declines persist.
Significance. If the central empirical findings hold after improved controls, the work would be significant for multilingual safety alignment research: it provides concrete evidence that response-based KD on teacher refusals can introduce rather than reduce vulnerabilities in low-resource languages, challenges the assumption of a clean transferable safety signal, and demonstrates a practical mitigation via boundary-refusal filtering. The use of established external benchmarks (MultiJail, XSafety, GSM8K) and parameter-efficient fine-tuning strengthens the empirical grounding and offers a reproducible starting point for future curation of distillation data.
major comments (2)
- [Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.
- [§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.
minor comments (2)
- [§3 (Method)] §3 (Method): the description of the ~28,000 XSafety prompts would benefit from an explicit breakdown of language distribution and how the black-box response collection was performed to ensure reproducibility.
- [Figure 2 or equivalent results table] Figure 2 or equivalent results table: axis labels and legend entries for the three student models could be clarified to improve readability of the JSR changes before/after boundary-refusal removal.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional clarity and controls will strengthen the manuscript's reproducibility and interpretability. We address each major comment below and outline the specific revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4 (Results)] Abstract and §4 (Results): the central claim of a 16.6 pp JSR increase is supported by concrete MultiJail numbers, but the manuscript provides limited detail on exact data splits, hyperparameter choices (LoRA rank/alpha, epochs, learning rate), statistical significance testing, and whether the increase is consistent across all three student models and all languages. These omissions are load-bearing for interpreting the magnitude and generality of the safety degradation.
Authors: We agree that these experimental details are necessary for full reproducibility and to support claims about generality. In the revised manuscript we will expand §4 with: (1) the precise train/validation/test splits of the ~28k XSafety prompts, (2) the complete LoRA configuration (rank, alpha, dropout, target modules), training hyperparameters (epochs, learning rate, batch size, optimizer), and (3) statistical significance results (bootstrap 95% CIs and paired tests on JSR deltas). We will also add per-model and per-language tables confirming that the JSR increase holds across all three student models and the languages evaluated in MultiJail. These additions will be presented without changing the reported effect sizes or conclusions. revision: yes
-
Referee: [§5 (Ablation study on boundary refusals)] §5 (Ablation study on boundary refusals): the mitigation achieved by removing nuanced 'boundary' refusals is presented as evidence that they are the primary source of degradation, yet the manuscript lacks an explicit operational definition of boundary refusals, inter-annotator agreement metrics, or a control ablation that isolates their effect while holding the prompt set, LoRA rank, and training hyperparameters fixed. If boundary refusals are identified by hedging language or partial compliance, their removal may also change response length or stylistic features, confounding attribution of the JSR rise to o1-mini response content rather than fine-tuning artifacts or distribution shift.
Authors: We accept that a more rigorous definition and controlled ablation are required. We will add: (1) an explicit operational definition of boundary refusals based on observable linguistic markers (hedging phrases, partial compliance, or conditional refusals), (2) inter-annotator agreement statistics (Cohen’s kappa) from the annotation process, and (3) a new control ablation that filters or rewrites responses to equalize length and stylistic distributions while keeping the prompt set, LoRA rank, alpha, epochs, and all other training hyperparameters identical. This control will help isolate whether the JSR mitigation is driven by the semantic content of the boundary refusals rather than length or style shifts. We will report these results in an expanded §5. revision: yes
Circularity Check
No circularity: purely empirical study relying on external benchmarks and standard fine-tuning
full rationale
The paper describes an exploratory empirical investigation of response-based knowledge distillation for multilingual safety alignment. It reports experimental results from LoRA fine-tuning on ~28k prompts from XSafety, evaluated on independent benchmarks (MultiJail, XSafety, GSM8K) without any claimed mathematical derivations, first-principles equations, or self-referential definitions of metrics. Success and degradation are measured via externally defined Jailbreak Success Rate and accuracy scores on held-out test sets. No load-bearing step reduces to a fitted parameter or self-citation chain by construction; all claims rest on observable experimental outcomes rather than internal redefinitions.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank and alpha
- Number of training epochs and learning rate
axioms (2)
- domain assumption The teacher's black-box responses constitute a high-quality, language-agnostic safety signal.
- domain assumption MultiJail prompts and success criteria generalize to real deployment scenarios.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Knowledge Distillation Must Account for What It Loses
Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.
-
Knowledge Distillation Must Account for What It Loses
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Reference graph
Works this paper leans on
-
[1]
Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. Why knowledge distillation amplifies gender bias and how to mitigate from the perspective of DistilBERT. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 266–272, Seattle, Washington, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022...
-
[2]
To distill or not to distill: Knowledge transfer undermines safety of LLMs
Anonymous. To distill or not to distill: Knowledge transfer undermines safety of LLMs. In Submitted to The F ourteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=AEKji3PwD9. under review
work page 2025
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[5]
Cascading adversarial bias from injection to distillation in language models, 2025
Harsh Chaudhari, Jamie Hayes, Matthew Jagielski, Ilia Shumailov, Milad Nasr, and Alina Oprea. Cascading adversarial bias from injection to distillation in language models, 2025. URL https://arxiv.org/abs/2505.24842
- [6]
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Multilingualjailbreakchallengesinlargelanguagemodels
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models, 2024. URLhttps://arxiv.org/abs/2310.06474
-
[9]
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...
-
[10]
URLhttps://arxiv.org/abs/2209.07858
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Gemma 3 technical report, 2025
Gemma Team et al. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786
work page 2025
-
[13]
A closer look at the limitations of instruction tuning, 2024
Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, and Dinesh Manocha. A closer look at the limitations of instruction tuning, 2024. URLhttps://arxiv.org/abs/2402.05119
-
[14]
Aaron Grattafiori et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/ 2407.21783. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022. URLhttps://arxiv.org/abs/2203.09509
-
[17]
Luxi He, Mengzhou Xia, and Peter Henderson. What is in your safe data? identifying benign data that breaks safety, 2024. URLhttps://arxiv.org/abs/2404.01099
-
[18]
Distilling the knowledge in a neural network,
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,
-
[19]
URLhttps://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023. URL https: //arxiv.org/abs/2305.02301
work page internal anchor Pith review arXiv 2023
-
[21]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2024. URL https://arxiv.org/abs/2311. 12786
work page 2024
-
[23]
Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024
Omkar Khade, Shruti Jagdale, Abhishek Phaltankar, Gauri Takalikar, and Raviraj Joshi. Chal- lenges in adapting multilingual llms to low-resource languages using lora peft tuning, 2024. URLhttps://arxiv.org/abs/2411.18571
-
[24]
James Kirkpatrick, Razfvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):35...
-
[25]
Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025
Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, and René Vidal. Kda: A knowledge-distilled attacker for generating diverse prompts to jailbreak llms, 2025. URL https://arxiv.org/abs/2502.05223
-
[26]
Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Ela- heh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, and Shohreh Kasaei. A comprehensive survey on knowledge distillation,
- [27]
-
[28]
A holistic approach to undesired content detection,
Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world, 2023. URLhttps://arxiv.org/abs/2208.03274
-
[29]
On the benefits of knowledge distillation for adversarial robustness, 2022
Javier Maroto, Guillermo Ortiz-Jiménez, and Pascal Frossard. On the benefits of knowledge distillation for adversarial robustness, 2022. URLhttps://arxiv.org/abs/2203.07159
-
[30]
OpenAI o1-mini: Advancing cost-efficient reasoning
OpenAI. OpenAI o1-mini: Advancing cost-efficient reasoning. https://openai.com/ index/openai-o1-mini-advancing-cost-efficient-reasoning/ , 2024. Accessed: 2025-10-19
work page 2024
-
[31]
OpenAI. Using logprobs. https://cookbook.openai.com/examples/using_logprobs,
-
[32]
Accessed: 2025-10-19
work page 2025
-
[33]
OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URLhttps://arxiv.org/abs/2202.03286
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Fine-tuning aligned language models compromises safety, even when users do not intend to!,
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!,
-
[37]
URLhttps://arxiv.org/abs/2310.03693
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024
Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual contexts, 2024. URLhttps://arxiv.org/abs/2401.13136
-
[39]
URLhttps://arxiv.org/abs/2508.06709.2508.06709
Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, and Miguel Ballesteros. Play favorites: A statistical method to measure self-bias in llm-as-a-judge, 2025. URLhttps://arxiv.org/abs/2508.06709
-
[40]
Yijun Tian, Yikun Han, Xiusi Chen, Wei Wang, and Nitesh V . Chawla. Beyond answers: Transferring reasoning capabilities to smaller llms using multi-teacher knowledge distillation,
- [41]
-
[42]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URLhttps://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language models,
- [45]
-
[46]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instruc- tions, 2023. URLhttps://arxiv.org/abs/2212.10560
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Oracle-Guided Program Selection from Large Language Models
Mingke Yang, Yuqi Chen, Yi Liu, and Ling Shi. Distillseq: A framework for safety alignment testing in large language models using knowledge distillation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’24, page 578–589. ACM, September 2024. doi: 10.1145/3650212.3680304. URL http://dx.doi. org/10.1145/...
-
[49]
Distilling rule-based knowledge into large language models, 2024
Wenkai Yang, Yankai Lin, Jie Zhou, and Ji-Rong Wen. Distilling rule-based knowledge into large language models, 2024. URLhttps://arxiv.org/abs/2311.08883
-
[50]
Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it, 2025. URLhttps://arxiv.org/abs/2505.24119
-
[51]
Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025
Haneul Yoo, Yongjin Yang, and Hwaran Lee. Code-switching red-teaming: Llm evaluation for safety and multilingual understanding, 2025. URLhttps://arxiv.org/abs/2406.15481
-
[52]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 12 A Technical Appendices and Supplementary Material A.1 Definitions of safe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.