pith. sign in
Pith Number

pith:SUDNVRJX

pith:2026:SUDNVRJXYPEWMY7LUGANHCHAKU
not attested not anchored not stored refs pending

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Mark Dras, Usman Naseem, Utsav Maskey

Aligned LLMs encode harmful refusal in one global hidden-state direction but spread over-refusal across separate task-specific subspaces.

arxiv:2603.27518 v3 · 2026-03-29 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{SUDNVRJXYPEWMY7LUGANHCHAKU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

harmful-refusal directions are task-agnostic and can be captured by a single global vector, whereas over-refusal directions are task-dependent: they reside within the benign task-representation clusters, vary across tasks, and span a higher-dimensional subspace.

C2weakest assumption

That linear probes and direction-finding methods on hidden states accurately isolate the causal mechanisms of refusal rather than merely capturing correlational patterns.

C3one line summary

Harmful refusal in aligned LLMs is captured by a single task-agnostic vector, but over-refusal directions are task-dependent, reside in benign task clusters, and span higher-dimensional subspaces.

Receipt and verification
First computed 2026-05-29T01:05:08.264221Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9506dac537c3c96663eba180d388e05530cf4304a998ff07bc025c7c5e6922db

Aliases

arxiv: 2603.27518 · arxiv_version: 2603.27518v3 · doi: 10.48550/arxiv.2603.27518 · pith_short_12: SUDNVRJXYPEW · pith_short_16: SUDNVRJXYPEWMY7L · pith_short_8: SUDNVRJX
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/SUDNVRJXYPEWMY7LUGANHCHAKU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9506dac537c3c96663eba180d388e05530cf4304a998ff07bc025c7c5e6922db
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "36a047054b045a16488b2793b3efb48cac8af020700fbefe2d1ca71f664e5fb3",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-03-29T04:53:40Z",
    "title_canon_sha256": "b7dd9d502d6f0c0f6a781fdf84e1f7fcd69d73c91989503844322f1f8955580e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2603.27518",
    "kind": "arxiv",
    "version": 3
  }
}