PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

Chang Wu; Guanjun Jiang; Houcheng Jiang; Junfeng Fang; Kai Tang; Pengyu Cheng; Xiang Wang; Xiaoxi Jiang

arxiv: 2606.25442 · v1 · pith:3QC25UTRnew · submitted 2026-06-24 · 💻 cs.CL

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

Chang Wu , Junfeng Fang , Houcheng Jiang , Kai Tang , Pengyu Cheng , Xiaoxi Jiang , Guanjun Jiang , Xiang Wang This is my paper

Pith reviewed 2026-06-25 21:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords safety alignmentlarge language modelspolicy alignmentself-distillationLLM safetyover-refusalcapability preservation

0 comments

The pith

LLMs can be directly aligned to safety policies stated in natural language without needing curated supervision data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Safety policies for large language models often arrive as plain text descriptions, but standard alignment requires expensive preference data or demonstrations. PolicyAlign generates violating instructions from the policy itself and uses on-policy self-distillation to train the model to follow it. A filtering step keeps only the instructions that cause the biggest change in behavior. Experiments on multiple models confirm higher safety scores, low over-refusal rates, and unchanged general performance. The approach extends to domain-specific policies in medicine, law, and finance.

Core claim

Given a safety policy, PolicyAlign synthesizes policy-violating instructions, selects those with largest behavioral shift via Policy-Sensitive Filtering, and performs on-policy self-distillation to internalize policy-guided behavior, resulting in improved safety without supervision data.

What carries the argument

On-policy self-distillation after synthesizing policy-violating instructions, combined with Policy-Sensitive Filtering to select high-impact examples.

If this is right

Consistent safety improvements across different LLMs.
Low over-refusal rates and preserved general capabilities.
Generalization to medical, legal, and financial safety scenarios.
Scalable alignment when new policies emerge without available training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthesis step works well, this reduces the need for human experts to create alignment datasets for every new policy.
Rapid policy updates become feasible in production systems by simply providing updated policy text.
Potential to extend the method to other behavioral policies beyond safety, such as style or domain constraints.
Combining with existing preference tuning methods could further enhance results.

Load-bearing premise

The model-generated instructions accurately reflect real-world policy violations and the self-distillation process avoids introducing new biases or capability losses not caught by the evaluation sets.

What would settle it

Running PolicyAlign on a policy and observing no increase in safety metrics on independent violation tests or a drop in performance on capability benchmarks would falsify the effectiveness claim.

read the original abstract

Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often specified as natural-language policies, while corresponding supervision data may be costly, delayed, or unavailable. This creates a mismatch between rapidly evolving safety policies and conventional data-driven alignment methods. To address this, we propose PolicyAlign, a simple yet effective framework for directly aligning LLMs with safety policies. Given a safety policy, PolicyAlign first synthesizes policy-violating instructions and then performs on-policy self-distillation to internalize policy-guided behavior. To improve training stability and data efficiency, we further introduce Policy-Sensitive Filtering, which selects instructions where the policy induces the largest behavioral shift. Experiments across multiple models show that PolicyAlign consistently improves safety while maintaining low over-refusal and preserving general capabilities. PolicyAlign also generalizes to medical, legal, and financial safety scenarios, highlighting its potential as a scalable and maintainable approach to policy-based LLM safety alignment. The code is released at https://github.com/Qwen-Applications/PolicyAlign.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolicyAlign gives a workable pipeline for turning natural-language policies into alignment data via self-synthesis and filtering, but the abstract supplies almost no experimental detail to judge whether the safety gains are real.

read the letter

The main point is that PolicyAlign takes a text safety policy, prompts the model to create violating instructions, then runs on-policy self-distillation while filtering for cases where the policy produces the largest output shift. This sidesteps the usual need for fresh preference data when policies change.

It handles a genuine deployment friction: policies from regulators or companies move faster than new labeled datasets can be built. The filtering step is a straightforward way to focus compute on high-impact examples. Releasing the code is useful for anyone who wants to inspect or extend the pipeline.

The evidence is the weak link. The abstract states consistent safety gains, low over-refusal, preserved capabilities, and generalization to medical, legal, and financial cases, yet it gives no model sizes, no metric definitions, no baseline comparisons, and no description of how the synthetic violations were validated against real edge cases. Without those, the central claim cannot be checked. The worry that model-generated violations may be too superficial or model-specific is reasonable and not resolved in the given text.

This is for engineers and researchers who need quick adaptation to new text policies rather than fundamental alignment advances. A practitioner facing regulatory updates could try the method and run their own tests. It deserves peer review because the approach is simple, the code is public, and referees can examine the actual runs to see if the data quality and evaluation choices hold up.

Referee Report

2 major / 1 minor

Summary. The paper proposes PolicyAlign, a framework for aligning LLMs directly with natural-language safety policies. Given a policy, it first prompts the base model to synthesize policy-violating instructions, then performs on-policy self-distillation on the resulting pairs; Policy-Sensitive Filtering is added to retain only instructions that induce the largest behavioral shift under the policy. Experiments across multiple models are reported to show consistent safety gains, low over-refusal rates, preserved general capabilities, and generalization to medical, legal, and financial domains. The code is released.

Significance. If the empirical claims hold, the method would address a practical mismatch between rapidly changing natural-language policies and the need for new supervision data, offering a more scalable route to policy-based alignment. The open release of code is a concrete strength that supports reproducibility and follow-up work.

major comments (2)

[Abstract / Method (synthesis step)] The headline claim (consistent safety gains without capability loss or over-refusal across models and domains) rests on the assumption that model-synthesized violating instructions faithfully sample the distribution of realistic policy violations that would arise in deployment. The abstract and method description provide no validation, coverage analysis, or comparison against human-written or red-team violations to support this; if the synthetic set is biased toward superficial or model-specific cases, the self-distillation step cannot be guaranteed to produce the reported generalization.
[Experiments] Experiments section: the reported improvements lack accompanying details on model sizes, exact evaluation metrics and thresholds, baseline implementations, statistical significance tests, or data exclusion rules. Without these, it is not possible to assess whether the claimed gains are robust or whether they could be artifacts of the chosen benchmarks.

minor comments (1)

[Abstract] The abstract states that Policy-Sensitive Filtering 'selects instructions where the policy induces the largest behavioral shift,' but does not define the precise divergence measure or threshold used; this notation should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below, providing clarifications and indicating planned revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Method (synthesis step)] The headline claim (consistent safety gains without capability loss or over-refusal across models and domains) rests on the assumption that model-synthesized violating instructions faithfully sample the distribution of realistic policy violations that would arise in deployment. The abstract and method description provide no validation, coverage analysis, or comparison against human-written or red-team violations to support this; if the synthetic set is biased toward superficial or model-specific cases, the self-distillation step cannot be guaranteed to produce the reported generalization.

Authors: We acknowledge that the manuscript does not include direct validation, coverage analysis, or comparisons of synthesized instructions to human-written or red-team violations. The synthesis step prompts the base model with the policy, and Policy-Sensitive Filtering retains cases with the largest behavioral shift under the policy. While cross-model and cross-domain results provide supporting evidence for generalization, we agree the assumption merits explicit discussion. In revision we will add a subsection on synthesis assumptions, potential biases, and qualitative examples of generated instructions. revision: partial
Referee: [Experiments] Experiments section: the reported improvements lack accompanying details on model sizes, exact evaluation metrics and thresholds, baseline implementations, statistical significance tests, or data exclusion rules. Without these, it is not possible to assess whether the claimed gains are robust or whether they could be artifacts of the chosen benchmarks.

Authors: We agree that greater specificity on experimental details would improve clarity and reproducibility. The manuscript reports results across multiple models and domains with the released code, but we will expand the Experiments section in the revised manuscript to explicitly state model sizes, precise metric definitions and thresholds, baseline implementation details, statistical significance tests and results, and data exclusion rules. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirical with external benchmarks

full rationale

The paper describes an empirical procedure: given a natural-language policy, synthesize violating instructions (via model prompting) and perform on-policy self-distillation, followed by Policy-Sensitive Filtering and evaluation on external benchmarks for safety, over-refusal, and capability. No equations, fitted parameters, or derivations are presented that reduce a claimed result to its inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing. The central claims rest on experimental outcomes measured against independent test sets rather than self-referential definitions or renamed inputs. This matches the default expectation of non-circularity for empirical alignment methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes that model-generated violations are representative and that self-distillation preserves capabilities, but these are not formalized.

pith-pipeline@v0.9.1-grok · 5746 in / 1131 out tokens · 25082 ms · 2026-06-25T21:07:40.853867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2504.15585 , year=

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment , author=. arXiv preprint arXiv:2504.15585 , year=

arXiv
[2]

arXiv preprint arXiv:2507.19672 , year=

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges , author=. arXiv preprint arXiv:2507.19672 , year=

arXiv
[3]

arXiv preprint arXiv:2502.15871 , volume=

A comprehensive survey on the trustworthiness of large language models in healthcare , author=. arXiv preprint arXiv:2502.15871 , volume=

arXiv
[4]

arXiv preprint arXiv:2509.10546 , year=

Uncovering the vulnerability of large language models in the financial domain via risk concealment , author=. arXiv preprint arXiv:2509.10546 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2405.01769 , year=

A survey on large language models for critical societal domains: Finance, healthcare, and law , author=. arXiv preprint arXiv:2405.01769 , year=

arXiv
[6]

arXiv preprint arXiv:2109.01652 , year=

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

Pith/arXiv arXiv
[7]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[8]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv
[9]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[10]

arXiv preprint arXiv:2404.02151 , year=

Jailbreaking leading safety-aligned llms with simple adaptive attacks , author=. arXiv preprint arXiv:2404.02151 , year=

arXiv
[11]

Advances in neural information processing systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in neural information processing systems , volume=
[12]

2023 , number =

Artificial Intelligence Risk Management Framework (AI RMF 1.0) , institution =. 2023 , number =

2023
[13]

2021 , url =

Ethics and Governance of Artificial Intelligence for Health: WHO Guidance , institution =. 2021 , url =

2021
[14]

Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) , year =

2024
[15]

2021 , url =

Good Machine Learning Practice for Medical Device Development: Guiding Principles , institution =. 2021 , url =

2021
[16]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2412.16339 , year=

Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=

arXiv
[18]

arXiv preprint arXiv:2602.12275 , year=

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[20]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

arXiv
[21]

Proceedings of ICLR , year=

MiniLLM: Knowledge Distillation of Large Language Models , author=. Proceedings of ICLR , year=
[22]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=
[23]

2025 , month =

Lu, Kevin , title =. 2025 , month =

2025
[24]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

2024
[25]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024
[26]

2025 , eprint=

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety , author=. 2025 , eprint=

2025
[27]

arXiv preprint arXiv:2406.01574 , year=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

Pith/arXiv arXiv
[28]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023
[29]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv
[30]

XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024
[31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[32]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=
[33]

arXiv preprint arXiv:2603.10387 , year=

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw , author=. arXiv preprint arXiv:2603.10387 , year=

arXiv
[34]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Jailbreak and guard aligned language models with only few in-context demonstrations , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[35]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=
[36]

China warns state-owned firms and government agencies against OpenClaw AI, sources say , year =
[37]

Risk Notice on the Secure Use of OpenClaw , year =
[38]

Six Musts and Six Don'ts

The "Six Musts and Six Don'ts" recommendations for preventing security risks of the OpenClaw (or "lobster") open-source intelligent agent , year =
[39]

2026 , howpublished =

2026
[40]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=
[41]

arXiv preprint arXiv:2507.14987 , year=

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning , author=. arXiv preprint arXiv:2507.14987 , year=

arXiv
[42]

arXiv preprint arXiv:2512.11391 , year=

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization , author=. arXiv preprint arXiv:2512.11391 , year=

arXiv
[43]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2510.14276 , year=

Qwen3Guard Technical Report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv
[46]

Advances in neural information processing systems , volume=

Medsafetybench: Evaluating and improving the medical safety of large language models , author=. Advances in neural information processing systems , volume=
[47]

arXiv preprint arXiv:2505.11413 , year=

Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms , author=. arXiv preprint arXiv:2505.11413 , year=

arXiv
[48]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

2021
[49]

Conference on health, inference, and learning , pages=

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering , author=. Conference on health, inference, and learning , pages=. 2022 , organization=

2022
[50]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[51]

arXiv preprint arXiv:2507.21134 , year=

Trident: Benchmarking llm safety in finance, medicine, and law , author=. arXiv preprint arXiv:2507.21134 , year=

arXiv
[52]

arXiv preprint arXiv:2505.12864 , year=

Lexam: Benchmarking legal reasoning on 340 law exams , author=. arXiv preprint arXiv:2505.12864 , year=

arXiv
[53]

arXiv preprint arXiv:2311.11944 , year=

Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=

Pith/arXiv arXiv
[54]

Geon-Hyeong Kim and Yu Jin Kim and Byoungjip Kim and Honglak Lee and Kyunghoon Bae and Youngsoo Jang and Moontae Lee , booktitle=. Safe. 2026 , url=

2026
[55]

2025 , url=

Yichi Zhang and Siyuan Zhang and Yao Huang and Zeyu Xia and Zhengwei Fang and Xiao Yang and Ranjie Duan and Dong Yan and Yinpeng Dong and Jun Zhu , booktitle=. 2025 , url=

2025
[56]

2025 , eprint=

SaRO: Enhancing LLM Safety through Reasoning-based Alignment , author=. 2025 , eprint=

2025
[57]

2026 , journal =

Reinforcement Learning via Self-Distillation , author =. 2026 , journal =

2026
[58]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026
[59]

2024 , month = jul, doi =

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile , author =. 2024 , month = jul, doi =

2024

[1] [1]

arXiv preprint arXiv:2504.15585 , year=

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment , author=. arXiv preprint arXiv:2504.15585 , year=

arXiv

[2] [2]

arXiv preprint arXiv:2507.19672 , year=

Alignment and safety in large language models: Safety mechanisms, training paradigms, and emerging challenges , author=. arXiv preprint arXiv:2507.19672 , year=

arXiv

[3] [3]

arXiv preprint arXiv:2502.15871 , volume=

A comprehensive survey on the trustworthiness of large language models in healthcare , author=. arXiv preprint arXiv:2502.15871 , volume=

arXiv

[4] [4]

arXiv preprint arXiv:2509.10546 , year=

Uncovering the vulnerability of large language models in the financial domain via risk concealment , author=. arXiv preprint arXiv:2509.10546 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2405.01769 , year=

A survey on large language models for critical societal domains: Finance, healthcare, and law , author=. arXiv preprint arXiv:2405.01769 , year=

arXiv

[6] [6]

arXiv preprint arXiv:2109.01652 , year=

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

Pith/arXiv arXiv

[7] [7]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[8] [8]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv

[9] [9]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[10] [10]

arXiv preprint arXiv:2404.02151 , year=

Jailbreaking leading safety-aligned llms with simple adaptive attacks , author=. arXiv preprint arXiv:2404.02151 , year=

arXiv

[11] [11]

Advances in neural information processing systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in neural information processing systems , volume=

[12] [12]

2023 , number =

Artificial Intelligence Risk Management Framework (AI RMF 1.0) , institution =. 2023 , number =

2023

[13] [13]

2021 , url =

Ethics and Governance of Artificial Intelligence for Health: WHO Guidance , institution =. 2021 , url =

2021

[14] [14]

Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) , year =

2024

[15] [15]

2021 , url =

Good Machine Learning Practice for Medical Device Development: Guiding Principles , institution =. 2021 , url =

2021

[16] [16]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2412.16339 , year=

Deliberative alignment: Reasoning enables safer language models , author=. arXiv preprint arXiv:2412.16339 , year=

arXiv

[18] [18]

arXiv preprint arXiv:2602.12275 , year=

On-policy context distillation for language models , author=. arXiv preprint arXiv:2602.12275 , year=

Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[20] [20]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

arXiv

[21] [21]

Proceedings of ICLR , year=

MiniLLM: Knowledge Distillation of Large Language Models , author=. Proceedings of ICLR , year=

[22] [22]

The twelfth international conference on learning representations , year=

On-policy distillation of language models: Learning from self-generated mistakes , author=. The twelfth international conference on learning representations , year=

[23] [23]

2025 , month =

Lu, Kevin , title =. 2025 , month =

2025

[24] [24]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

2024

[25] [25]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024

[26] [26]

2025 , eprint=

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety , author=. 2025 , eprint=

2025

[27] [27]

arXiv preprint arXiv:2406.01574 , year=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

Pith/arXiv arXiv

[28] [28]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023

[29] [29]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv

[30] [30]

XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024

[31] [31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[32] [32]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

[33] [33]

arXiv preprint arXiv:2603.10387 , year=

Don't Let the Claw Grip Your Hand: A Security Analysis and Defense Framework for OpenClaw , author=. arXiv preprint arXiv:2603.10387 , year=

arXiv

[34] [34]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Jailbreak and guard aligned language models with only few in-context demonstrations , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

[35] [35]

Proceedings of the 26th annual international conference on machine learning , pages=

Curriculum learning , author=. Proceedings of the 26th annual international conference on machine learning , pages=

[36] [36]

China warns state-owned firms and government agencies against OpenClaw AI, sources say , year =

[37] [37]

Risk Notice on the Secure Use of OpenClaw , year =

[38] [38]

Six Musts and Six Don'ts

The "Six Musts and Six Don'ts" recommendations for preventing security risks of the OpenClaw (or "lobster") open-source intelligent agent , year =

[39] [39]

2026 , howpublished =

2026

[40] [40]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

[41] [41]

arXiv preprint arXiv:2507.14987 , year=

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning , author=. arXiv preprint arXiv:2507.14987 , year=

arXiv

[42] [42]

arXiv preprint arXiv:2512.11391 , year=

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization , author=. arXiv preprint arXiv:2512.11391 , year=

arXiv

[43] [43]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2510.14276 , year=

Qwen3Guard Technical Report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:2410.21276 , year=

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

Pith/arXiv arXiv

[46] [46]

Advances in neural information processing systems , volume=

Medsafetybench: Evaluating and improving the medical safety of large language models , author=. Advances in neural information processing systems , volume=

[47] [47]

arXiv preprint arXiv:2505.11413 , year=

Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms , author=. arXiv preprint arXiv:2505.11413 , year=

arXiv

[48] [48]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

2021

[49] [49]

Conference on health, inference, and learning , pages=

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering , author=. Conference on health, inference, and learning , pages=. 2022 , organization=

2022

[50] [50]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[51] [51]

arXiv preprint arXiv:2507.21134 , year=

Trident: Benchmarking llm safety in finance, medicine, and law , author=. arXiv preprint arXiv:2507.21134 , year=

arXiv

[52] [52]

arXiv preprint arXiv:2505.12864 , year=

Lexam: Benchmarking legal reasoning on 340 law exams , author=. arXiv preprint arXiv:2505.12864 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2311.11944 , year=

Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=

Pith/arXiv arXiv

[54] [54]

Geon-Hyeong Kim and Yu Jin Kim and Byoungjip Kim and Honglak Lee and Kyunghoon Bae and Youngsoo Jang and Moontae Lee , booktitle=. Safe. 2026 , url=

2026

[55] [55]

2025 , url=

Yichi Zhang and Siyuan Zhang and Yao Huang and Zeyu Xia and Zhengwei Fang and Xiao Yang and Ranjie Duan and Dong Yan and Yinpeng Dong and Jun Zhu , booktitle=. 2025 , url=

2025

[56] [56]

2025 , eprint=

SaRO: Enhancing LLM Safety through Reasoning-based Alignment , author=. 2025 , eprint=

2025

[57] [57]

2026 , journal =

Reinforcement Learning via Self-Distillation , author =. 2026 , journal =

2026

[58] [58]

2026 , eprint=

Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

2026

[59] [59]

2024 , month = jul, doi =

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile , author =. 2024 , month = jul, doi =

2024