Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.
multiple times
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Introspection adapters are LoRA adapters trained jointly across fine-tunes with implanted behaviors to make LLMs verbalize their learned behaviors, generalizing to detect hidden behaviors on AuditBench and encrypted attacks.