Prompt-driven LLM safeguarding via directed representation optimization

· 2024 · arXiv 2401.18018

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 3

representative citing papers

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

cs.LG · 2026-02-05 · unverdicted · novelty 6.0

f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

cs.CR · 2025-06-02 · unverdicted · novelty 5.0

ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

citing papers explorer

Showing 9 of 9 citing papers.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 67
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 206
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving cs.CV · 2026-05-17 · unverdicted · none · ref 48
CLAP reduces planning error on challenging driving scenarios by 24% on NAVSIM using contrastive latent-space prompt optimization on frozen VLA models with no regression on normal frames.
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 17
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 131
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment cs.LG · 2026-02-05 · unverdicted · none · ref 17
f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction cs.AI · 2026-05-18 · unverdicted · none · ref 59
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 35
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 118
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Prompt-driven LLM safeguarding via directed representation optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer