Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
citing papers explorer
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
Boundary-targeted Membership Inference Attacks on Safety Classifiers
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
-
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
-
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.