RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Prompt-driven LLM safeguarding via directed representation optimization
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 3representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
citing papers explorer
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
-
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
-
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
-
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
-
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.