Safety layers of aligned large language models: The key to llm security

Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li , year= · 2024 · arXiv 2408.17003

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

cs.CR · 2026-05-09 · unverdicted · novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

A Lightweight Explainable Guardrail for Prompt Safety

cs.CL · 2026-01-24 · conditional · novelty 6.0

LEG is a compact model that jointly classifies unsafe prompts and explains its decisions using synthetic training data and a custom uncertainty-weighted loss.

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

cs.CR · 2026-06-26 · unverdicted · novelty 5.0

Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.

SALLIE: Safeguarding Against Latent Language & Image Exploits

cs.CR · 2026-04-06 · unverdicted · novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

cs.CL · 2025-11-09

citing papers explorer

Showing 5 of 5 citing papers after filters.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 26
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs cs.CR · 2026-05-09 · unverdicted · none · ref 30
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models cs.CR · 2026-06-26 · unverdicted · none · ref 38
Jailbreak attacks suppress Adversarially Compromised Heads in early layers but leave Safety-Aligned Heads active in mid-layers, producing robust harmful features usable for competitive training-free detection.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 12
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 86
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

Safety layers of aligned large language models: The key to llm security

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer