arXiv preprint arXiv:2406.14144 , year=

· 2024 · arXiv 2406.14144

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

Secure LLM Fine-Tuning via Safety-Aware Probing

cs.LG · 2025-05-22 · unverdicted · novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

citing papers explorer

Showing 5 of 5 citing papers.

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling cs.CR · 2026-05-18 · unverdicted · none · ref 2
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 12
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
Why Do Large Language Models Generate Harmful Content? cs.AI · 2026-04-13 · unverdicted · none · ref 4
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 11
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Secure LLM Fine-Tuning via Safety-Aware Probing cs.LG · 2025-05-22 · unverdicted · none · ref 27
SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

arXiv preprint arXiv:2406.14144 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer