Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
MVIGER integrates complementary knowledge from diverse prompts and indices in generative recommenders via a variational model with learnable prior over latent sources, showing superior performance on three datasets.
citing papers explorer
-
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
-
MVIGER: Multi-View Variational Integration of Complementary Knowledge for Generative Recommender
MVIGER integrates complementary knowledge from diverse prompts and indices in generative recommenders via a variational model with learnable prior over latent sources, showing superior performance on three datasets.