pith:SUDNVRJX
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
Aligned LLMs encode harmful refusal in one global hidden-state direction but spread over-refusal across separate task-specific subspaces.
arxiv:2603.27518 v3 · 2026-03-29 · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{SUDNVRJXYPEWMY7LUGANHCHAKU}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace.
That linear probes and direction-finding methods on hidden states accurately isolate the causal mechanisms of refusal rather than merely capturing correlational patterns.
Harmful refusal in aligned LLMs is captured by a single task-agnostic vector, but over-refusal directions are task-dependent, reside in benign task clusters, and span higher-dimensional subspaces.
Receipt and verification
| First computed | 2026-05-29T01:05:08.264221Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
9506dac537c3c96663eba180d388e05530cf4304a998ff07bc025c7c5e6922db
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/SUDNVRJXYPEWMY7LUGANHCHAKU \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9506dac537c3c96663eba180d388e05530cf4304a998ff07bc025c7c5e6922db
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "36a047054b045a16488b2793b3efb48cac8af020700fbefe2d1ca71f664e5fb3",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-03-29T04:53:40Z",
"title_canon_sha256": "b7dd9d502d6f0c0f6a781fdf84e1f7fcd69d73c91989503844322f1f8955580e"
},
"schema_version": "1.0",
"source": {
"id": "2603.27518",
"kind": "arxiv",
"version": 3
}
}