arxiv: 2605.02971 · v2 · submitted 2026-05-03 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Multilingual Safety Alignment via Self-Distillation

Ruiyang Qin , Qingzhuo Wang , Dongrui Liu , Qiang Li , Zhihua Wei , Wen Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords multilingual safety alignmentself-distillationjailbreak attackscross-lingual transferlarge language modelssafety weightinglow-resource languages

0 comments

The pith

Self-distillation transfers an LLM's safety capabilities from high-resource languages to low-resource ones using only input queries and no response data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models maintain strong defenses against jailbreaks in English and similar high-resource languages but become easy targets in low-resource languages. The paper demonstrates that self-distillation on multilingual queries alone can move those defenses across languages without creating any labeled responses or safety data in the target languages. It presents a flexible Multilingual Self-Distillation framework with on-policy and off-policy variants plus a Dual-Perspective Safety Weighting scheme that raises penalties on safety-critical tokens. Experiments across models and benchmarks show consistent gains in multilingual safety that extend to harder attacks and unseen languages while leaving general capabilities unchanged. This removes the need for expensive per-language response collection that current alignment methods require.

Core claim

The Multilingual Self-Distillation framework transfers an LLM's inherent safety capabilities from high-resource languages such as English to low-resource languages such as Javanese by operating solely on multilingual queries, without any response data or explicit safety labels in the target languages. Two concrete implementations (on-policy MSD and off-policy MSD) enable the transfer, and Dual-Perspective Safety Weighting optimizes the objective by adaptively increasing weights on safety-critical tokens based on divergence between teacher and student views. The resulting models achieve superior safety on diverse multilingual jailbreak and utility benchmarks, generalize to more challenging or

What carries the argument

Multilingual Self-Distillation (MSD) framework with Dual-Perspective Safety Weighting (DPSW) that performs cross-lingual safety transfer by distilling from the model's own outputs on queries alone and reweighting tokens according to joint teacher-student divergence.

If this is right

Safety alignment for additional languages requires only queries rather than costly generation of labeled responses in each language.
The approach extends robustness to harder jailbreak datasets and languages absent from the training queries.
General model capabilities on utility benchmarks stay intact after the safety transfer step.
The framework integrates with multiple existing self-distillation strategies without modification to the core transfer logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-only transfer could be tested on other model properties such as factual consistency or reduced bias across languages.
Deployment in regions using low-resource languages becomes more practical because data collection costs drop.
Combining the method with light supervised fine-tuning on a few high-resource examples might further improve results in extremely low-resource settings.

Load-bearing premise

An LLM's safety capabilities present in high-resource languages transfer to low-resource languages through self-distillation on queries without any response data or safety labels in the target languages.

What would settle it

Apply the method to a fresh low-resource language never seen in training or evaluation and measure whether jailbreak attack success rate on a new set of adversarial prompts in that language drops substantially below the baseline model's rate.

Figures

Figures reproduced from arXiv: 2605.02971 by Dongrui Liu, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Wen Shen, Zhihua Wei.

**Figure 1.** Figure 1: The MSD framework leverages self-distillation within the same LLM view at source ↗

**Figure 2.** Figure 2: Workflow of the MSD method. Given a target low-resource query view at source ↗

**Figure 3.** Figure 3: The teacher prompt template used in the MSD framework. view at source ↗

**Figure 4.** Figure 4: Translation prompt of the PKU-SafeRLHF dataset. view at source ↗

**Figure 4.** Figure 4: Translation prompt of the PKU-SafeRLHF dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation prompt of multilingual safety evaluation. view at source ↗

**Figure 5.** Figure 5: Evaluation prompt of multilingual safety evaluation. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Data generation prompt of multilingual response data. view at source ↗

**Figure 6.** Figure 6: Data generation prompt of multilingual response data. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: DPSW span annotation prompt. 19 view at source ↗

**Figure 7.** Figure 7: DPSW span annotation prompt. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Case study of DPSW span annotation. Green text denotes safety-critical while gray text denotes non-critical spans. The span-level average weights shown in parentheses demonstrate that safety-critical spans receive substantially higher DPSW weights than non-critical spans. 20 view at source ↗

**Figure 9.** Figure 9: Case study of invalid responses in Swahili. view at source ↗

read the original abstract

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a query-only self-distillation route to move safety from high-resource to low-resource languages, but the DPSW weighting may not be delivering genuine cross-lingual transfer.

read the letter

The main thing to know is that this work claims to fix multilingual safety misalignment in LLMs by distilling from English-level safeguards to languages like Javanese using only input queries and no response data at all. They introduce the MSD framework with on-policy and off-policy variants plus a Dual-Perspective Safety Weighting scheme that adjusts the distillation loss based on divergence between teacher and student outputs. Experiments reportedly show better jailbreak resistance across benchmarks while holding utility steady and generalizing to unseen languages. What is new is the specific combination of self-distillation with this adaptive token weighting for cross-lingual transfer without any target-language labels. The paper does a solid job naming the practical cost of per-language safety data and testing the idea on several models and datasets. The framework looks flexible enough to plug into other distillation setups. The soft spots sit in the central mechanism. The approach assumes that divergence on queries alone can identify and penalize unsafe tokens in low-resource languages, but if the student model starts misaligned there, it is not clear how DPSW creates a reliable safety signal rather than just regularizing outputs or benefiting from extra query exposure. The stress-test note is on point here: gains on challenging or unseen languages could come from broader training rather than active transfer. The abstract states consistent superiority, yet without the full tables, baseline details, and controls for query-only effects, it is difficult to confirm the claim holds up. This is for researchers working on practical LLM alignment for global use. Anyone focused on reducing jailbreak risks in underserved languages without heavy data collection would get concrete ideas from the method. It deserves a serious referee because the problem is real and the technique is distinct enough to check in detail. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Multilingual Self-Distillation (MSD), a cross-lingual safety transfer framework that uses only multilingual queries (no response data or labels) to move inherent safety capabilities from high-resource languages (e.g., English) to low-resource ones (e.g., Javanese) in LLMs. It introduces on-policy and off-policy variants of MSD along with Dual-Perspective Safety Weighting (DPSW), a divergence measure that adaptively weights safety-critical tokens by considering both teacher and student perspectives. Experiments on representative LLMs across multilingual jailbreak and utility benchmarks are claimed to show consistent superiority, effective generalization to harder datasets and unseen languages, and preservation of general capabilities.

Significance. If the results and mechanism hold, the work would be significant for practical LLM deployment: it removes the need for expensive, high-quality response data in every target language, offering a scalable path to multilingual safety alignment while maintaining utility. The approach is flexible across self-distillation strategies and directly targets a documented weakness in current LLMs.

major comments (2)

[Abstract and §3] Abstract and §3 (MSD and DPSW description): The central transfer claim requires that DPSW can reliably identify and penalize safety-critical tokens cross-lingually using only query inputs and the student model's own (potentially misaligned) outputs. The manuscript does not provide a concrete account of the cross-lingual safety signal (e.g., via shared representations, English logits, or internal activations) that would allow the student perspective to supply useful supervision when the student is already vulnerable on low-resource inputs; without this, gains could be explained by regularization or query exposure alone.
[§5] §5 (Experiments): The claim of consistent superiority and generalization to unseen languages is load-bearing, yet the reported results lack sufficient detail on data splits, exact baseline implementations, quantitative metrics per language, and controls for query-only effects. This prevents assessment of whether the observed improvements are attributable to active safety transfer via DPSW rather than incidental factors.

minor comments (2)

[§3] Notation for on-policy vs. off-policy MSD variants should be introduced with explicit equations or pseudocode in §3 to clarify the difference in how the student is updated.
[Abstract] The abstract states 'preserving the model's general capabilities' but does not specify which utility benchmarks were used or report the magnitude of any degradation; a table summarizing both safety and utility deltas would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract and §3] The central transfer claim requires that DPSW can reliably identify and penalize safety-critical tokens cross-lingually using only query inputs and the student model's own (potentially misaligned) outputs. The manuscript does not provide a concrete account of the cross-lingual safety signal (e.g., via shared representations, English logits, or internal activations) that would allow the student perspective to supply useful supervision when the student is already vulnerable on low-resource inputs; without this, gains could be explained by regularization or query exposure alone.

Authors: We agree that a more explicit mechanistic account would strengthen the presentation. The cross-lingual signal originates from the teacher (the same LLM operating with its inherent high-resource safety alignment) applied to parallel multilingual queries; because the queries share semantic content across languages, the teacher's safer output distributions provide the supervisory signal. DPSW then computes token-level divergences from both teacher and student perspectives to up-weight safety-critical tokens where the student deviates. We will add a dedicated paragraph in §3 clarifying this process, including the role of shared model parameters and representations that enable transfer without language-specific responses. We will also include an ablation isolating query exposure to address alternative explanations. revision: yes
Referee: [§5] The claim of consistent superiority and generalization to unseen languages is load-bearing, yet the reported results lack sufficient detail on data splits, exact baseline implementations, quantitative metrics per language, and controls for query-only effects. This prevents assessment of whether the observed improvements are attributable to active safety transfer via DPSW rather than incidental factors.

Authors: We concur that greater experimental transparency is needed. In the revision we will expand §5 and the appendix with: (i) precise descriptions of all data splits and query sourcing; (ii) full implementation details for every baseline, including query-only controls; (iii) per-language numerical results in main tables rather than aggregated figures; and (iv) additional ablations that directly compare MSD against regularization and query-exposure-only variants. These changes will allow readers to isolate the contribution of DPSW-driven safety transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical transfer via query-only self-distillation

full rationale

The paper introduces Multilingual Self-Distillation (MSD) as an empirical training procedure that optimizes a distillation loss on multilingual queries alone, using the proposed DPSW divergence to reweight tokens. The central result—that safety transfers from high-resource to low-resource languages—is obtained by running the optimization and then measuring performance on held-out jailbreak and utility benchmarks; nothing in the method or evaluation reduces the reported gains to a definitional identity or to a parameter fitted directly to the target metric. No self-citation is invoked as a uniqueness theorem, no ansatz is smuggled, and the derivation chain consists of standard self-distillation steps plus an adaptive weighting heuristic whose effect is validated externally rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that safety knowledge exists in high-resource languages and can transfer via query-only self-distillation. No explicit free parameters or invented physical entities are named; DPSW and MSD are methodological inventions without independent falsifiable handles outside the proposed experiments.

axioms (1)

domain assumption LLMs possess inherent safety capabilities in high-resource languages that can be leveraged for transfer.
Invoked as the foundation for cross-lingual transfer in the abstract.

invented entities (2)

Multilingual Self-Distillation (MSD) framework no independent evidence
purpose: Cross-lingual safety transfer without target-language response data
Newly proposed method with on-policy and off-policy variants.
Dual-Perspective Safety Weighting (DPSW) no independent evidence
purpose: Adaptive token weighting in distillation objective
New divergence measure that considers teacher and student perspectives.

pith-pipeline@v0.9.0 · 5555 in / 1311 out tokens · 58412 ms · 2026-05-11T01:03:06.681358+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSD framework... Dual-Perspective Safety Weighting (DPSW)... token-level divergence measure... wT_t = 1 - (top-K entropy)... wS_t = 1 - pS(y*_t)... L(θ) = E[ w̃_t · D(pS || pT) ]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-lingual safeguard transfer... no response data... on-policy/off-policy self-distillation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 30 canonical work pages · 18 internal anchors

[1]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

work page internal anchor Pith review arXiv
[2]

Self-Distillation Enables Continual Learning

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

work page internal anchor Pith review arXiv
[3]

Reinforcement Learning via Self-Distillation

Reinforcement Learning via Self-Distillation , author=. arXiv preprint arXiv:2601.20802 , year=

work page internal anchor Pith review arXiv
[4]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=
[5]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

Sequence-level knowledge distillation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

2016
[7]

MiniLLM: On-Policy Distillation of Large Language Models

Minillm: Knowledge distillation of large language models , author=. arXiv preprint arXiv:2306.08543 , year=

work page internal anchor Pith review arXiv
[8]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
[9]

On-Policy Context Distillation for Language Models

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

work page internal anchor Pith review arXiv
[10]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

On-Policy Self-Distillation for Reasoning Compression , author=. arXiv preprint arXiv:2603.05433 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

MiMo-V2-Flash Technical Report

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

work page internal anchor Pith review arXiv
[12]

OpenClaw-RL: Train Any Agent Simply by Talking

Openclaw-rl: Train any agent simply by talking , author=. arXiv preprint arXiv:2603.10165 , year=

work page Pith review arXiv
[13]

Self-Distilled RLVR

Self-Distilled RLVR , author=. arXiv preprint arXiv:2604.03128 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? , author=. arXiv preprint arXiv:2603.24472 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Neural Information Processing Systems , year=

The Llama 3 herd of models , author=. Neural Information Processing Systems , year=
[16]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Low-resource languages jailbreak gpt-4

Low-resource languages jailbreak gpt-4 , author=. arXiv preprint arXiv:2310.02446 , year=

work page arXiv
[19]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[20]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review arXiv
[21]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
[22]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[23]

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

A cross-language investigation into jailbreak attacks in large language models , author=. arXiv preprint arXiv:2401.16765 , year=

work page arXiv
[24]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

The language barrier: Dissecting safety challenges of llms in multilingual contexts , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[25]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

The state of multilingual llm safety research: From measuring the language gap to mitigating it , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[26]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

2025
[27]

Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages

Isaac Lim and Shaun Khoo and Watson Wei Khong Chua and Jessica Foo and Jia Yi Goh and Roy Ka-Wei Lee , booktitle=. Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages. 2025 , url=

2025
[28]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Crosslingual generalization through multitask finetuning , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[29]

CoRR , volume =

Llama beyond english: An empirical study on language capability transfer , author=. arXiv preprint arXiv:2401.01055 , year=

work page arXiv
[30]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=
[31]

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for

Yuyan Bu and Xiaohao Liu and ZhaoXing Ren and Yaodong Yang and Juntao Dai , booktitle=. Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for. 2026 , url=

2026
[32]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Enhancing multilingual capabilities of large language models through self-distillation from resource-rich languages , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[33]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mpo: Multilingual safety alignment via reward gap optimization , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[34]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Refusal Direction is Universal Across Safety-Aligned Languages , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[35]

arXiv preprint arXiv:2602.01283 , year=

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons , author=. arXiv preprint arXiv:2602.01283 , year=

work page arXiv
[36]

arXiv preprint arXiv:2602.22554 , year=

Multilingual Safety Alignment Via Sparse Weight Editing , author=. arXiv preprint arXiv:2602.22554 , year=

work page arXiv
[37]

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety , author=. arXiv preprint arXiv:2604.12710 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Wang, et al., All languages matter: On the multilingual safety of large language models (2023)

All languages matter: On the multilingual safety of large language models , author=. arXiv preprint arXiv:2310.00905 , year=

work page arXiv
[41]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[42]

The Twelfth International Conference on Learning Representations , year=

Multilingual Jailbreak Challenges in Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[43]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[44]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[45]

arXiv preprint arXiv:2403.00409 , year=

Provably robust dpo: Aligning language models with noisy feedback , author=. arXiv preprint arXiv:2403.00409 , year=

work page arXiv
[46]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Disentangling length from quality in direct preference optimization , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[47]

2024 , url=

Mansi Phute and Alec Helbling and Matthew Daniel Hull and ShengYun Peng and Sebastian Szyller and Cory Cornelius and Duen Horng Chau , booktitle=. 2024 , url=

2024
[48]

Language models are multi- lingual chain-of-thought reasoners,

Language models are multilingual chain-of-thought reasoners , author=. arXiv preprint arXiv:2210.03057 , year=

work page arXiv
[49]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining , pages=
[51]

International conference on machine learning , pages=

Born again neural networks , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[52]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Be your own teacher: Improve the performance of convolutional neural networks via self distillation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[53]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Snapshot distillation: Teacher-student optimization in one generation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[54]

Reinforced Self-Training (ReST) for Language Modeling

Reinforced self-training (rest) for language modeling , author=. arXiv preprint arXiv:2308.08998 , year=

work page Pith review arXiv
[55]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
[56]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[57]

Advances in Neural Information Processing Systems , volume=

Principle-driven self-alignment of language models from scratch with minimal human supervision , author=. Advances in Neural Information Processing Systems , volume=
[58]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-distillation bridges distribution gap in language model fine-tuning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[59]

International Conference on Machine Learning , pages=

Self-Rewarding Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[60]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Don’t say no: Jailbreaking llm by suppressing refusal , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[61]

The Thirteenth International Conference on Learning Representations , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. The Thirteenth International Conference on Learning Representations , year=
[62]

Neurips Safe Generative AI Workshop 2024 , year=

Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models , author=. Neurips Safe Generative AI Workshop 2024 , year=

2024
[63]

arXiv preprint arXiv:2603.07445 , year=

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning , author=. arXiv preprint arXiv:2603.07445 , year=

work page arXiv
[64]

Safety Instincts:

Guobin Shen and Dongcheng Zhao and Haibo Tong and Jindong Li and Feifei Zhao and Yi Zeng , booktitle=. Safety Instincts:. 2026 , url=

2026
[65]

arXiv preprint arXiv:2402.05070 , year =

A roadmap to pluralistic alignment , author=. arXiv preprint arXiv:2402.05070 , year=

work page arXiv
[66]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Value FULCRA: Mapping large language models to the multidimensional spectrum of basic human value , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[67]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review arXiv