For the feature banks used in these experiments, the stored percentile anchors are: 1.δ c,p95 =0.4161, 2.δ c,p50 =0.0769, 3.δ cos,p95 =0.1651, 4.variance p95 =1.3636

otherwise usesstructure · 2023 · arXiv 4161.5724

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

cs.CR · 2026-06-02 · unverdicted · novelty 4.0

NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.

citing papers explorer

Showing 1 of 1 citing paper.

NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense cs.CR · 2026-06-02 · unverdicted · none · ref 25
NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.

For the feature banks used in these experiments, the stored percentile anchors are: 1.δ c,p95 =0.4161, 2.δ c,p50 =0.0769, 3.δ cos,p95 =0.1651, 4.variance p95 =1.3636

fields

years

verdicts

representative citing papers

citing papers explorer