pith:HSKNHNGD
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
On-policy self-distillation with privileged safety contexts reduces the safety tax while preserving reasoning in LLMs.
arxiv:2605.15239 v1 · 2026-05-14 · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{HSKNHNGDRTIAXV2QBJMCM5TZ6F}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B).
The privileged safety context must make the frozen teacher reliably safer than the student trajectory, and the teacher flip rate must identify contexts that activate latent safety reasoning rather than simply producing safe-looking surface demonstrations.
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
References
Formal links
Receipt and verification
| First computed | 2026-05-20T00:05:47.726170Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
3c94d3b4c38cd00bd7500a58267679f17ea23e2cd804f740998aefc07e98207d
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/HSKNHNGDRTIAXV2QBJMCM5TZ6F \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3c94d3b4c38cd00bd7500a58267679f17ea23e2cd804f740998aefc07e98207d
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "771ca615e3f8482df0a1d10148fdccbb529b2a2224f1bd8bfc733029907ebb18",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-05-14T03:40:07Z",
"title_canon_sha256": "06535b4219a6c259385195a93165002000bc862a61b2554439977fd12f7563f0"
},
"schema_version": "1.0",
"source": {
"id": "2605.15239",
"kind": "arxiv",
"version": 1
}
}