I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022

Simon Willison · 2022 · arXiv web/2022092

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

The Enforcement and Feasibility of Hate Speech Moderation on Twitter

cs.CY · 2026-04-14 · conditional · novelty 6.0

80% of hateful tweets remain online after five months with no higher removal rate than non-hateful content, while human-AI moderation pipelines can feasibly cut user exposure below regulatory penalty costs.

Ignore Previous Prompt: Attack Techniques For Language Models

cs.CL · 2022-11-17 · unverdicted · novelty 6.0

PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

citing papers explorer

Showing 2 of 2 citing papers.

The Enforcement and Feasibility of Hate Speech Moderation on Twitter cs.CY · 2026-04-14 · conditional · none · ref 31
80% of hateful tweets remain online after five months with no higher removal rate than non-hateful content, while human-AI moderation pipelines can feasibly cut user exposure below regulatory penalty costs.
Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 29
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

I missed this one: Someone did get a prompt leak attack to work against the bot, Sep 2022

fields

years

verdicts

representative citing papers

citing papers explorer