Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
citing papers explorer
-
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
-
Why Do Large Language Models Generate Harmful Content?
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
-
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.