pith:W3AYUNLZ
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Alignment in language models routes safety policies through early attention gates rather than erasing unsafe capabilities.
arxiv:2604.04385 v5 · 2026-04-06 · cs.CL · cs.AI · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{W3AYUNLZEN3MUG2TPF2KH65SME}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
The safety-trained capability is gated by routing, not removed; modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering, and any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.
That interchange interventions and knockout cascades isolate the causal contribution of the identified gate and amplifier heads without substantial side effects on other circuits or on the model's general capability.
Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying capability.
Cited by
Receipt and verification
| First computed | 2026-06-30T02:17:20.138356Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
b6c18a35792376ca1b537974a3fbb2611876e15fd7f7ed7276e3dde75a33cdf6
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/W3AYUNLZEN3MUG2TPF2KH65SME \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b6c18a35792376ca1b537974a3fbb2611876e15fd7f7ed7276e3dde75a33cdf6
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "f522f2b99c1fb0ffb7e53ab70c8fefe8c3473976e4c4ffa1703bcdf5cbd639ec",
"cross_cats_sorted": [
"cs.AI",
"cs.LG"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-04-06T03:20:37Z",
"title_canon_sha256": "2535a2bd1b60beef235b319f09b2c08e0dc47b227dca504ee3bdfb81f47b9caf"
},
"schema_version": "1.0",
"source": {
"id": "2604.04385",
"kind": "arxiv",
"version": 5
}
}