Title resolution pending

URL https: //arxiv · 2024 · arXiv 2406.05644

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

citing papers explorer

Showing 3 of 3 citing papers.

Before the Last Token: Diagnosing Final-Token Safety Probe Failures cs.LG · 2026-05-12 · unverdicted · none · ref 15
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Why Do Large Language Models Generate Harmful Content? cs.AI · 2026-04-13 · unverdicted · none · ref 29
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models cs.LG · 2026-04-08 · unverdicted · none · ref 22
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer