arxiv: 2308.03825 · v2 · submitted 2023-08-07 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen , Zeyuan Chen , Michael Backes , Yun Shen , Yang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:29 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords jailbreakpromptsllmsattackpromptcommunitiesdaysdecember

0 comments

The pith

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This study examines prompts that users create to make large language models ignore their built-in rules against generating harmful content. Researchers built a framework called JailbreakHub to gather 1,405 such prompts from December 2022 to December 2023, finding 131 different online communities where these prompts are shared. They noticed that the prompts often rely on tricks like pretending the model is in a different role or escalating access privileges. The team also built a large test set of 107,250 questions covering 13 types of forbidden topics. When they tested these prompts on six popular LLMs, the safety mechanisms failed to block harmful outputs in many cases. Five specific prompts stood out as especially effective, succeeding 95 percent of the time on both GPT-3.5 and GPT-4. One of these prompts had been circulating online for more than 240 days. The work shows that jailbreak techniques are evolving and moving to new sharing platforms.

Core claim

our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4

Load-bearing premise

The 1,405 collected prompts and the 107,250-question set across 13 scenarios are representative enough to support broad conclusions about the inadequacy of safeguards on all LLMs.

read the original abstract

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gathers a large set of real jailbreak prompts and shows a few of them break GPT models at 0.95 success rate, but the broad claim about all safeguards and scenarios hinges on how representative the collection actually is.

read the letter

The main takeaway is that they assembled 1405 jailbreak prompts from online sources spanning a year, spotted patterns across 131 communities, and tracked 28 accounts refining prompts over 100 days. They also built a 107250-question test set across 13 scenarios and ran it on six LLMs. Five prompts hit 0.95 attack success on GPT-3.5 and GPT-4, with one lasting over 240 days. That scale and the persistence data are the concrete new pieces here compared to smaller earlier studies.

Referee Report

2 major / 2 minor

Summary. The paper introduces the JailbreakHub framework to analyze 1,405 jailbreak prompts collected from December 2022 to December 2023 across 131 communities. It identifies characteristics, attack strategies such as prompt injection and privilege escalation, and trends like shifts to prompt-aggregation sites and persistent optimization by 28 accounts. The authors construct a dataset of 107,250 questions spanning 13 forbidden scenarios and evaluate six popular LLMs, reporting that five jailbreak prompts achieve 0.95 attack success rates on GPT-3.5 and GPT-4, concluding that current safeguards cannot adequately defend against jailbreaks in all scenarios.

Significance. If the collected prompts and scenarios prove representative, this work offers a valuable large-scale empirical characterization of real-world jailbreak techniques and a reusable question dataset that can drive improvements in LLM safety. The longitudinal tracking of prompt persistence (e.g., one effective prompt lasting over 240 days) and the scale of testing across multiple models provide concrete evidence of safeguard limitations that is useful for both researchers and vendors.

major comments (2)

The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.
In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.

minor comments (2)

The abstract introduces JailbreakHub without a one-sentence overview of its main components or data pipeline, which would help readers quickly grasp the contribution.
Ensure all figures and tables include clear captions explaining the 13 scenarios and how attack success is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our manuscript. We address each major comment point-by-point below and indicate the revisions we will make in the next version of the paper.

read point-by-point responses

Referee: The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.

Authors: We appreciate the referee's point on the need for explicit methodological transparency to support claims of representativeness. The full manuscript describes identifying 131 communities through systematic searches across platforms including Reddit, Discord, and specialized forums, followed by collection of prompts that explicitly attempt to bypass LLM safeguards. The 13 forbidden scenarios were derived from prohibited categories in the usage policies of major LLM providers (e.g., OpenAI, Anthropic) and aligned with prior safety evaluation benchmarks. However, we acknowledge that a more detailed account of sampling methodology, precise exclusion criteria (such as requiring clear jailbreak intent and excluding duplicates or non-functional prompts), and validation steps for the scenarios would help readers assess potential selection bias. We will add a dedicated subsection to the Methods section outlining these processes, including any acknowledged limitations on generalizability. This revision will directly address the concern. revision: yes
Referee: In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.

Authors: We agree that precise experimental details are necessary to substantiate the quantitative results and the broader claim. In the manuscript, attack success rate (ASR) is defined as the proportion of model responses that successfully produce the requested harmful or forbidden content without issuing a refusal. To mitigate stochasticity, we ran three independent trials for each prompt-scenario combination using consistent prompt formatting and reported the averaged ASR. Response evaluation combined automated detection of common refusal keywords and phrases with manual review of edge cases. We will expand the Experimental Evaluation section to explicitly document the ASR definition, trial counts, parsing methodology, and controls for variability (such as noting default temperature settings and refusal patterns). These additions will provide the requested rigor without altering the reported findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical collection and measurement study

full rationale

The paper performs data collection of 1,405 jailbreak prompts from 131 external online communities over a one-year period, constructs an independent question set of 107,250 samples across 13 scenarios, and reports direct attack-success measurements on six public LLMs. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the derivation chain; all reported results (including the 0.95 ASR on five prompts) are obtained by applying the collected prompts to external models and counting observed successes. The work is therefore self-contained against external benchmarks and contains no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study is observational and relies on the assumption that sampled online prompts and chosen forbidden scenarios adequately represent the space of possible attacks; no new theoretical entities or fitted parameters are introduced in the abstract.

axioms (2)

domain assumption Jailbreak prompts can be reliably identified and categorized from public online sources.
Implicit in the collection of 1405 prompts and identification of 131 communities.
domain assumption Attack success rate measured on a fixed question set reflects real-world harm potential.
Used to support the claim that safeguards are inadequate.

pith-pipeline@v0.9.0 · 5563 in / 1367 out tokens · 91832 ms · 2026-05-17T08:29:12.747903+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
cs.CL 2023-10 conditional novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
cs.CR 2026-05 unverdicted novelty 6.0

Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
cs.CR 2026-05 unverdicted novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
cs.CR 2026-04 unverdicted novelty 6.0

Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.
Exclusive Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
A StrongREJECT for Empty Jailbreaks
cs.LG 2024-02 conditional novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
Low-Resource Languages Jailbreak GPT-4
cs.CL 2023-10 conditional novelty 6.0

Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
cs.CR 2026-04 unverdicted novelty 5.0

FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 18 Pith papers · 8 internal anchors

[1]

https: //assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/ 1146542/a_pro-innovation_approach_to_AI_ regulation.pdf

A pro-innovation approach to AI regulation. https: //assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/ 1146542/a_pro-innovation_approach_to_AI_ regulation.pdf. 1, 3

work page
[2]

https://www.aiprm.com/

AIPRM. https://www.aiprm.com/. 4

work page
[3]

https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts

Awesome ChatGPT Prompts. https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts. 4

work page
[4]

https://chat.openai.com/chat

ChatGPT. https://chat.openai.com/chat. 1, 3, 8, 17

work page
[5]

https://disboard.org/

Disboard. https://disboard.org/. 4

work page
[6]

https://en.wikipedia.org/wiki/Discord

Discord. https://en.wikipedia.org/wiki/Discord. 4

work page
[7]

https://flowgpt.com/

FlowGPT. https://flowgpt.com/. 4

work page
[8]

https:// gdpr-info.eu/

General Data Protection Regulation (GDPR). https:// gdpr-info.eu/. 3

work page
[9]

https://www.jailbreakchat.com

JailbreakChat. https://www.jailbreakchat.com. 1, 4

work page
[10]

http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm

Measures for the Management of Generative Artificial Intelli- gence Services. http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm. 1, 3

work page 2023
[11]

https:// artificialintelligenceact.eu/

The Artificial Intelligence Act. https:// artificialintelligenceact.eu/. 1, 3, 13

work page
[12]

Open-Source Large Language Models Out- perform Crowd Workers and Approach ChatGPT in Text- Annotation Tasks

Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin De- hghani, Juan Diego Bermeo, Maria Korobeynikova, and Fab- rizio Gilardi. Open-Source Large Language Models Out- perform Crowd Workers and Approach ChatGPT in Text- Annotation Tasks. CoRR abs/2307.02179, 2023. 17

work page arXiv 2023
[13]

PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Ke- fan Xiao, Yuanzhong Xu, Yuji...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures

Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures. In IEEE Symposium on Security and Privacy (S&P), pages 769–786. IEEE, 2022. 13

work page 2022
[15]

A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. A Multitask, Multilingual, Multimodal Evaluation of Chat- GPT on Reasoning, Hallucination, and Interactivity. CoRR abs/2302.04023, 2023. 3

work page arXiv 2023
[16]

The Pushshift Reddit Dataset

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The Pushshift Reddit Dataset. In International Conference on Web and Social Me- dia (ICWSM), pages 830–839. AAAI, 2020. 4

work page 2020
[17]

Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale

Marzieh Bitaab, Haehyun Cho, Adam Oest, Zhuoer Lyu, Wei Wang, Jorij Abraham, Ruoyu Wang, Tiffany Bao, Yan Shoshi- taishvili, and Adam Doupé. Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale. In IEEE Sympo- sium on Security and Privacy (S&P), pages 2566–2583. IEEE,

work page
[18]

Bad Characters: Imperceptible NLP Attacks

Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nico- las Papernot. Bad Characters: Imperceptible NLP Attacks. In IEEE Symposium on Security and Privacy (S&P) , pages 1987–2004. IEEE, 2022. 13

work page 1987
[19]

Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. In USENIX Security Symposium (USENIX Se- curity), pages 2633–2650. USENIX, 2021. 13

work page 2021
[20]

OPWNAI : Cybercriminals Starting To Use ChatGPT

Checkpoint. OPWNAI : Cybercriminals Starting To Use ChatGPT. https://research.checkpoint.com/ 2023/opwnai-cybercriminals-starting-to-use- chatgpt/#single-post, April 2023. 1

work page 2023
[21]

BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements

Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. In Annual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 13

work page 2021
[22]

PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English

Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, and Kai-Wei Chang. PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English. In Annual Meeting of the Association for Computational Linguistics (ACL) , pages 352–365. ACL, 2023. 3

work page 2023
[23]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 3, 8, 17

work page 2023
[24]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In Annual Conference on Neural Information Processing Systems (NIPS), pages 4299–

work page
[25]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. 3, 8, 17

work page 2023
[26]

Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots. CoRR abs/2307.08715, 2023. 12

work page arXiv 2023
[27]

BreakGPT

Discord. BreakGPT. https://disboard.org/server/ 1090300946568986810. 1

work page
[28]

GLM: General Language Model Pretraining with Autoregressive Blank Infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 320–335. ACL, 2022. 17

work page 2022
[29]

Fleiss’ kappa statistic without paradoxes

Rosa Falotico and Piero Quatto. Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 2015. 5 14

work page 2015
[30]

The Impact of Chat- GPT on Streaming Media: A Crowdsourced and Data-Driven Analysis using Twitter and Reddit

Yunhe Feng, Pradhyumna Poralla, Swagatika Dash, Kaicheng Li, Vrushabh Desai, and Meikang Qiu. The Impact of Chat- GPT on Streaming Media: A Crowdsourced and Data-Driven Analysis using Twitter and Reddit. In IEEE International Conference on Big Data Security on Cloud, High Performance and Smart Computing and Intelligent Data and Security (Big- DataSecurity...

work page 2023
[31]

Paraphrase a text

FlowGPT. Paraphrase a text. https://flowgpt.com/p/ paraphrase-a-text. 11

work page
[32]

AI ACROSS GOOGLE: PaLM 2

Google. AI ACROSS GOOGLE: PaLM 2. https://ai. google/discover/palm2/. 1, 8

work page
[33]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injec- tion Threats to Application-Integrated Large Language Mod- els. CoRR abs/2302.12173, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns

Julian Hazell. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns. CoRR abs/2305.06972, 2023. 1, 3, 13

work page arXiv 2023
[35]

MGTBench: Benchmarking Machine-Generated Text Detection

Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. MGTBench: Benchmarking Machine-Generated Text Detection. CoRR abs/2303.14822, 2023. 13

work page arXiv 2023
[36]

Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettle- moyer. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), pages 1875–1885. ACL, 2018. 13

work page 2018
[37]

Is ChatGPT A Good Translator? A Prelim- inary Study

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. Is ChatGPT A Good Translator? A Prelim- inary Study. CoRR abs/2301.08745, 2023. 3

work page arXiv 2023
[38]

Perspective API

Jigsaw. Perspective API. https://www.perspectiveapi. com. 9

work page
[39]

Is BERT Really Robust? A Strong Baseline for Natural Lan- guage Attack on Text Classification and Entailment

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Lan- guage Attack on Text Classification and Entailment. In AAAI Conference on Artificial Intelligence (AAAI) , pages 8018–

work page
[40]

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks. CoRR abs/2302.05733, 2023. 1, 2, 3, 13

work page arXiv 2023
[41]

On the Reli- ability of Watermarks for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the Reli- ability of Watermarks for Large Language Models. CoRR abs/2306.04634, 2023. 11

work page arXiv 2023
[42]

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12, 13

work page arXiv 2023
[43]

PalmTree: Learning an Assembly Language Model for Instruction Embedding

Xuezixiang Li, Yu Qu, and Heng Yin. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 3236–3251. ACM, 2021. 3

work page 2021
[44]

Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services

Zilong Lin, Jian Cui, Xiaojing Liao, and XiaoFeng Wang. Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services. CoRR abs/2401.03315, 2024. 2, 13

work page arXiv 2024
[45]

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Nat- ural Language Processing

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- roaki Hayashi, and Graham Neubig. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Nat- ural Language Processing. ACM Computing Surveys , 2023. 3

work page 2023
[46]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Analyzing Leak- age of Personally Identifiable Information in Language Mod- els

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. In IEEE Symposium on Security and Privacy (S&P), pages 346–363. IEEE, 2023. 13

work page 2023
[48]

A Holistic Approach to Undesired Content Detection in the Real World

Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloun- dou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A Holistic Approach to Undesired Content Detection in the Real World. CoRR abs/208.03274, 2022. 1, 2, 12

work page 2022
[49]

UMAP: Uniform Manifold Approximation and Projection

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software, 2018. 5

work page 2018
[50]

Generalized Louvain method for com- munity detection in large networks

Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti. Generalized Louvain method for com- munity detection in large networks. In International Confer- ence on Intelligent Systems Design and Applications (ISDA) , pages 88–93. IEEE, 2011. 6, 17

work page 2011
[51]

Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. ACL, 2022. 18

work page 2022
[52]

Barbosa, Olivia Figueira, Yang Wang, and Gang Wang

Jaron Mink, Licheng Luo, Natã M. Barbosa, Olivia Figueira, Yang Wang, and Gang Wang. DeepPhish: Understanding User Trust Towards Artificially Generated Profiles in Online Social Networks. In USENIX Security Symposium (USENIX Security), pages 1669–1686. USENIX, 2022. 13

work page 2022
[53]

Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks

Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. In Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 8332–8347. ACL, 2022. 13

work page 2022
[54]

And We Will Fight For Our Race!

Alexandros Mittos, Savvas Zannettou, Jeremy Blackburn, and Emiliano De Cristofaro. “And We Will Fight For Our Race!” A Measurement Study of Genetic Testing Conversations on Reddit and 4chan. In International Conference on Web and Social Media (ICWSM), pages 452–463. AAAI, 2020. 3

work page 2020
[55]

Mark E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences ,

work page
[56]

AI Risk Management Framework

NIST. AI Risk Management Framework. https://www. nist.gov/itl/ai-risk-management-framework . 3, 13

work page
[57]

NeMo-Guardrails

NVIDIA. NeMo-Guardrails. https://github.com/ NVIDIA/NeMo-Guardrails. 1, 2, 12

work page
[58]

ChatGPT can now see, hear, and speak

OpenAI. ChatGPT can now see, hear, and speak. https://openai.com/blog/chatgpt-can-now-see- hear-and-speak. 5

work page
[59]

Function calling and other API updates

OpenAI. Function calling and other API updates. https://openai.com/blog/function-calling-and- other-api-updates. 5

work page
[60]

Moderation Endpoint

OpenAI. Moderation Endpoint. https://platform. openai.com/docs/guides/moderation/overview. 12 15

work page
[61]

New models and developer products announced at DevDay

OpenAI. New models and developer products announced at DevDay. https://openai.com/blog/new-models-and- developer-products-announced-at-devday . 5

work page
[62]

OpenAI. Pricing. https://openai.com/pricing. 5

work page
[63]

Usage policies

OpenAI. Usage policies. https://openai.com/policies/ usage-policies. 2, 8, 20

work page
[64]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. CoRR abs/2303.08774 ,

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

work page 2022
[66]

To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023

Alessandro Pegoraro, Kavita Kumari, Hossein Fereidooni, and Ahmad-Reza Sadeghi. To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023. 3

work page arXiv 2023
[67]

Can Large Language Models Reason about Program Invariants? In International Conference on Machine Learning (ICML)

Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can Large Language Models Reason about Program Invariants? In International Conference on Machine Learning (ICML). JMLR, 2023. 3

work page 2023
[68]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. CoRR abs/2211.09527 ,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2023. 1, 3, 13

work page 2023
[70]

r/ChatGPTJailbreak

Reddit. r/ChatGPTJailbreak. https://www.reddit.com/r/ ChatGPTJailbreak/. 1, 3

work page
[71]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019. 5

work page 2019
[72]

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Associa- tion for Computational Linguistics (ACL) , pages 4902–4912. ACL, 2020. 11

work page 2020
[73]

Rivers and Bryan L

Caitlin M. Rivers and Bryan L. Lewis. Ethical research stan- dards in a world of big data. F1000Research, 2014. 3

work page 2014
[74]

On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Omar Shaikh, Hongxin Zhang, William Held, Michael Bern- stein, and Diyi Yang. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 4454–4470. ACL, 2023. 8

work page 2023
[75]

In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. CoRR abs/2304.08979, 2023. 13

work page arXiv 2023
[76]

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In ACM SIGSAC Con- ference on Computer and Communications Security (CCS) , pages 2659–2673. ACM, 2022. 9

work page 2022
[77]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize from human feed- back. CoRR abs/2009.01325, 2020. 17

work page arXiv 2009
[78]

Blueprint for an AI Bill of Rights

The White House. Blueprint for an AI Bill of Rights. https: //www.whitehouse.gov/ostp/ai-bill-of-rights/ . 1, 3

work page
[79]

OPUS-MT - Build- ing open translation services for the World

Jörg Tiedemann and Santhosh Thottingal. OPUS-MT - Build- ing open translation services for the World. In Conference of the European Association for Machine Translation (EAMT) , pages 479–480. European Association for Machine Transla- tion, 2020. 11

work page 2020
[80]

OpenChatKit

Together. OpenChatKit. https://github.com/ togethercomputer/OpenChatKit. 1, 2, 12

work page

Showing first 80 references.