Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
arXiv preprint arXiv:2408.17003 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
LEG is a compact model that jointly classifies unsafe prompts and explains its decisions using synthetic training data and a custom uncertainty-weighted loss.
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
Why Do Large Language Models Generate Harmful Content?
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
-
A Lightweight Explainable Guardrail for Prompt Safety
LEG is a compact model that jointly classifies unsafe prompts and explains its decisions using synthetic training data and a custom uncertainty-weighted loss.
-
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.