Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks //
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
D²-Monitor routes between lightweight and heavy safety probes using the count of hesitation steps in diffusion LLM denoising trajectories, achieving SOTA trade-off on three datasets with under 0.85M parameters.
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
Hidden-state probes enable low-overhead streaming moderation of LLM outputs by producing per-token safety scores from internal activations.
Prompt injection detection performance is highly regime-dependent with no single detector dominating across settings; transformer models perform best overall while structural signals offer modest gains in some regimes.
citing papers explorer
No citing papers match the current filters.