pith:3QNU6ZTU
Mechanisms of Introspective Awareness
Large language models detect injected steering vectors through a two-stage circuit that emerges after preference optimization.
arxiv:2603.21396 v4 · 2026-03-22 · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{3QNU6ZTUTOAN5YGL32KFYKSXF6}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
We trace the detection mechanism to a two-stage circuit in which 'evidence carrier' features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream 'gate' features that implement a default negative response.
The assumption that the observed changes in activation patterns after steering vector injection are causally responsible for the behavioral detection rather than merely correlated with it, which rests on the validity of the ablation and patching experiments used to identify the evidence-carrier and gate features.
DPO training induces a two-stage detection circuit in LLMs using early evidence-carrier features and downstream gate features that is absent in base models and distinct from later-layer identification mechanisms.
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-20T00:01:40.442286Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
dc1b4f66749b80dee0cbde945c2a572f98634240808e8e8f21d7ac153019fddf
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/3QNU6ZTUTOAN5YGL32KFYKSXF6 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: dc1b4f66749b80dee0cbde945c2a572f98634240808e8e8f21d7ac153019fddf
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "0faeb7e14e32bcd80119d5aaf9260561081c70e8281f62eef5936815f21d0569",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.LG",
"submitted_at": "2026-03-22T20:45:34Z",
"title_canon_sha256": "3fda1e1ac9d141c93af47e8f354aad7d3f19a82fc23eb61f50a7bcde85c043ce"
},
"schema_version": "1.0",
"source": {
"id": "2603.21396",
"kind": "arxiv",
"version": 4
}
}