Fundamental limitations of alignment in large language models, 2023

Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CL · 2023-08-27 · unverdicted · novelty 5.0

Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

Showing 1 of 1 citing paper.

Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 33
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.