NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.
For the feature banks used in these experiments, the stored percentile anchors are: 1.δ c,p95 =0.4161, 2.δ c,p50 =0.0769, 3.δ cos,p95 =0.1651, 4.variance p95 =1.3636
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense
NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.