pith. sign in
Pith Number

pith:HSKNHNGD

pith:2026:HSKNHNGDRTIAXV2QBJMCM5TZ6F
not attested not anchored not stored refs resolved

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Haz Sameen Shahgir, Hui Liu, Longxuan Yu, N. Benjamin Erichson, Yue Dong, Yu Fu, Zhipeng Wei

On-policy self-distillation with privileged safety contexts reduces the safety tax while preserving reasoning in LLMs.

arxiv:2605.15239 v1 · 2026-05-14 · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{HSKNHNGDRTIAXV2QBJMCM5TZ6F}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across two reasoning-model families and five model scales, OPSA achieves a stronger safety--reasoning tradeoff than off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning, with the largest gains on smaller models (+8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B).

C2weakest assumption

The privileged safety context must make the frozen teacher reliably safer than the student trajectory, and the teacher flip rate must identify contexts that activate latent safety reasoning rather than simply producing safe-looking surface demonstrations.

C3one line summary

On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.

References

42 extracted · 42 resolved · 18 Pith anchors

[1] Advances in neural information processing systems , volume=
[2] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback · arXiv:2204.05862
[3] Jailbreak Attacks and Defenses Against Large Language Models: A Survey · arXiv:2407.04295
[4] Findings of the Association for Computational Linguistics: ACL 2025 , pages= 2025
[5] Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:05:47.726170Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

3c94d3b4c38cd00bd7500a58267679f17ea23e2cd804f740998aefc07e98207d

Aliases

arxiv: 2605.15239 · arxiv_version: 2605.15239v1 · doi: 10.48550/arxiv.2605.15239 · pith_short_12: HSKNHNGDRTIA · pith_short_16: HSKNHNGDRTIAXV2Q · pith_short_8: HSKNHNGD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/HSKNHNGDRTIAXV2QBJMCM5TZ6F \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3c94d3b4c38cd00bd7500a58267679f17ea23e2cd804f740998aefc07e98207d
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "771ca615e3f8482df0a1d10148fdccbb529b2a2224f1bd8bfc733029907ebb18",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T03:40:07Z",
    "title_canon_sha256": "06535b4219a6c259385195a93165002000bc862a61b2554439977fd12f7563f0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15239",
    "kind": "arxiv",
    "version": 1
  }
}