pith. machine review for the scientific record. sign in

arxiv: 2308.03825 · v2 · submitted 2023-08-07 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:29 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords jailbreakpromptsllmsattackpromptcommunitiesdaysdecember
0
0 comments X

The pith

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This study examines prompts that users create to make large language models ignore their built-in rules against generating harmful content. Researchers built a framework called JailbreakHub to gather 1,405 such prompts from December 2022 to December 2023, finding 131 different online communities where these prompts are shared. They noticed that the prompts often rely on tricks like pretending the model is in a different role or escalating access privileges. The team also built a large test set of 107,250 questions covering 13 types of forbidden topics. When they tested these prompts on six popular LLMs, the safety mechanisms failed to block harmful outputs in many cases. Five specific prompts stood out as especially effective, succeeding 95 percent of the time on both GPT-3.5 and GPT-4. One of these prompts had been circulating online for more than 240 days. The work shows that jailbreak techniques are evolving and moving to new sharing platforms.

Core claim

our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4

Load-bearing premise

The 1,405 collected prompts and the 107,250-question set across 13 scenarios are representative enough to support broad conclusions about the inadequacy of safeguards on all LLMs.

read the original abstract

The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the JailbreakHub framework to analyze 1,405 jailbreak prompts collected from December 2022 to December 2023 across 131 communities. It identifies characteristics, attack strategies such as prompt injection and privilege escalation, and trends like shifts to prompt-aggregation sites and persistent optimization by 28 accounts. The authors construct a dataset of 107,250 questions spanning 13 forbidden scenarios and evaluate six popular LLMs, reporting that five jailbreak prompts achieve 0.95 attack success rates on GPT-3.5 and GPT-4, concluding that current safeguards cannot adequately defend against jailbreaks in all scenarios.

Significance. If the collected prompts and scenarios prove representative, this work offers a valuable large-scale empirical characterization of real-world jailbreak techniques and a reusable question dataset that can drive improvements in LLM safety. The longitudinal tracking of prompt persistence (e.g., one effective prompt lasting over 240 days) and the scale of testing across multiple models provide concrete evidence of safeguard limitations that is useful for both researchers and vendors.

major comments (2)
  1. The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.
  2. In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.
minor comments (2)
  1. The abstract introduces JailbreakHub without a one-sentence overview of its main components or data pipeline, which would help readers quickly grasp the contribution.
  2. Ensure all figures and tables include clear captions explaining the 13 scenarios and how attack success is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our manuscript. We address each major comment point-by-point below and indicate the revisions we will make in the next version of the paper.

read point-by-point responses
  1. Referee: The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.

    Authors: We appreciate the referee's point on the need for explicit methodological transparency to support claims of representativeness. The full manuscript describes identifying 131 communities through systematic searches across platforms including Reddit, Discord, and specialized forums, followed by collection of prompts that explicitly attempt to bypass LLM safeguards. The 13 forbidden scenarios were derived from prohibited categories in the usage policies of major LLM providers (e.g., OpenAI, Anthropic) and aligned with prior safety evaluation benchmarks. However, we acknowledge that a more detailed account of sampling methodology, precise exclusion criteria (such as requiring clear jailbreak intent and excluding duplicates or non-functional prompts), and validation steps for the scenarios would help readers assess potential selection bias. We will add a dedicated subsection to the Methods section outlining these processes, including any acknowledged limitations on generalizability. This revision will directly address the concern. revision: yes

  2. Referee: In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.

    Authors: We agree that precise experimental details are necessary to substantiate the quantitative results and the broader claim. In the manuscript, attack success rate (ASR) is defined as the proportion of model responses that successfully produce the requested harmful or forbidden content without issuing a refusal. To mitigate stochasticity, we ran three independent trials for each prompt-scenario combination using consistent prompt formatting and reported the averaged ASR. Response evaluation combined automated detection of common refusal keywords and phrases with manual review of edge cases. We will expand the Experimental Evaluation section to explicitly document the ASR definition, trial counts, parsing methodology, and controls for variability (such as noting default temperature settings and refusal patterns). These additions will provide the requested rigor without altering the reported findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical collection and measurement study

full rationale

The paper performs data collection of 1,405 jailbreak prompts from 131 external online communities over a one-year period, constructs an independent question set of 107,250 samples across 13 scenarios, and reports direct attack-success measurements on six public LLMs. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the derivation chain; all reported results (including the 0.95 ASR on five prompts) are obtained by applying the collected prompts to external models and counting observed successes. The work is therefore self-contained against external benchmarks and contains no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study is observational and relies on the assumption that sampled online prompts and chosen forbidden scenarios adequately represent the space of possible attacks; no new theoretical entities or fitted parameters are introduced in the abstract.

axioms (2)
  • domain assumption Jailbreak prompts can be reliably identified and categorized from public online sources.
    Implicit in the collection of 1405 prompts and identification of 131 communities.
  • domain assumption Attack success rate measured on a fixed question set reflects real-world harm potential.
    Used to support the claim that safeguards are inadequate.

pith-pipeline@v0.9.0 · 5563 in / 1367 out tokens · 91832 ms · 2026-05-17T08:29:12.747903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  2. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  3. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    cs.CL 2023-10 conditional novelty 7.0

    Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

  4. Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

    cs.CR 2026-05 unverdicted novelty 6.0

    Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.

  5. On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

    cs.CR 2026-05 conditional novelty 6.0

    An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

  6. Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

    cs.CR 2026-05 unverdicted novelty 6.0

    Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

  7. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  8. Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

    cs.CR 2026-04 unverdicted novelty 6.0

    Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.

  9. Exclusive Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.

  10. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  11. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...

  12. A StrongREJECT for Empty Jailbreaks

    cs.LG 2024-02 conditional novelty 6.0

    StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

  13. Low-Resource Languages Jailbreak GPT-4

    cs.CL 2023-10 conditional novelty 6.0

    Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.

  14. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  15. FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

    cs.CR 2026-04 unverdicted novelty 5.0

    FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.

  16. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

  17. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

  18. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    https: //assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/ 1146542/a_pro-innovation_approach_to_AI_ regulation.pdf

    A pro-innovation approach to AI regulation. https: //assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/ 1146542/a_pro-innovation_approach_to_AI_ regulation.pdf. 1, 3

  2. [2]

    https://www.aiprm.com/

    AIPRM. https://www.aiprm.com/. 4

  3. [3]

    https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts

    Awesome ChatGPT Prompts. https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts. 4

  4. [4]

    https://chat.openai.com/chat

    ChatGPT. https://chat.openai.com/chat. 1, 3, 8, 17

  5. [5]

    https://disboard.org/

    Disboard. https://disboard.org/. 4

  6. [6]

    https://en.wikipedia.org/wiki/Discord

    Discord. https://en.wikipedia.org/wiki/Discord. 4

  7. [7]

    https://flowgpt.com/

    FlowGPT. https://flowgpt.com/. 4

  8. [8]

    https:// gdpr-info.eu/

    General Data Protection Regulation (GDPR). https:// gdpr-info.eu/. 3

  9. [9]

    https://www.jailbreakchat.com

    JailbreakChat. https://www.jailbreakchat.com. 1, 4

  10. [10]

    http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm

    Measures for the Management of Generative Artificial Intelli- gence Services. http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm. 1, 3

  11. [11]

    https:// artificialintelligenceact.eu/

    The Artificial Intelligence Act. https:// artificialintelligenceact.eu/. 1, 3, 13

  12. [12]

    Open-Source Large Language Models Out- perform Crowd Workers and Approach ChatGPT in Text- Annotation Tasks

    Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin De- hghani, Juan Diego Bermeo, Maria Korobeynikova, and Fab- rizio Gilardi. Open-Source Large Language Models Out- perform Crowd Workers and Approach ChatGPT in Text- Annotation Tasks. CoRR abs/2307.02179, 2023. 17

  13. [13]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Ke- fan Xiao, Yuanzhong Xu, Yuji...

  14. [14]

    Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures

    Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures. In IEEE Symposium on Security and Privacy (S&P), pages 769–786. IEEE, 2022. 13

  15. [15]

    A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. A Multitask, Multilingual, Multimodal Evaluation of Chat- GPT on Reasoning, Hallucination, and Interactivity. CoRR abs/2302.04023, 2023. 3

  16. [16]

    The Pushshift Reddit Dataset

    Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The Pushshift Reddit Dataset. In International Conference on Web and Social Me- dia (ICWSM), pages 830–839. AAAI, 2020. 4

  17. [17]

    Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale

    Marzieh Bitaab, Haehyun Cho, Adam Oest, Zhuoer Lyu, Wei Wang, Jorij Abraham, Ruoyu Wang, Tiffany Bao, Yan Shoshi- taishvili, and Adam Doupé. Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale. In IEEE Sympo- sium on Security and Privacy (S&P), pages 2566–2583. IEEE,

  18. [18]

    Bad Characters: Imperceptible NLP Attacks

    Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nico- las Papernot. Bad Characters: Imperceptible NLP Attacks. In IEEE Symposium on Security and Privacy (S&P) , pages 1987–2004. IEEE, 2022. 13

  19. [19]

    Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

    Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. In USENIX Security Symposium (USENIX Se- curity), pages 2633–2650. USENIX, 2021. 13

  20. [20]

    OPWNAI : Cybercriminals Starting To Use ChatGPT

    Checkpoint. OPWNAI : Cybercriminals Starting To Use ChatGPT. https://research.checkpoint.com/ 2023/opwnai-cybercriminals-starting-to-use- chatgpt/#single-post, April 2023. 1

  21. [21]

    BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements

    Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. In Annual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 13

  22. [22]

    PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English

    Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, and Kai-Wei Chang. PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English. In Annual Meeting of the Association for Computational Linguistics (ACL) , pages 352–365. ACL, 2023. 3

  23. [23]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 3, 8, 17

  24. [24]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In Annual Conference on Neural Information Processing Systems (NIPS), pages 4299–

  25. [25]

    Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

    Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. 3, 8, 17

  26. [26]

    Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots. CoRR abs/2307.08715, 2023. 12

  27. [27]

    BreakGPT

    Discord. BreakGPT. https://disboard.org/server/ 1090300946568986810. 1

  28. [28]

    GLM: General Language Model Pretraining with Autoregressive Blank Infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 320–335. ACL, 2022. 17

  29. [29]

    Fleiss’ kappa statistic without paradoxes

    Rosa Falotico and Piero Quatto. Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 2015. 5 14

  30. [30]

    The Impact of Chat- GPT on Streaming Media: A Crowdsourced and Data-Driven Analysis using Twitter and Reddit

    Yunhe Feng, Pradhyumna Poralla, Swagatika Dash, Kaicheng Li, Vrushabh Desai, and Meikang Qiu. The Impact of Chat- GPT on Streaming Media: A Crowdsourced and Data-Driven Analysis using Twitter and Reddit. In IEEE International Conference on Big Data Security on Cloud, High Performance and Smart Computing and Intelligent Data and Security (Big- DataSecurity...

  31. [31]

    Paraphrase a text

    FlowGPT. Paraphrase a text. https://flowgpt.com/p/ paraphrase-a-text. 11

  32. [32]

    AI ACROSS GOOGLE: PaLM 2

    Google. AI ACROSS GOOGLE: PaLM 2. https://ai. google/discover/palm2/. 1, 8

  33. [33]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injec- tion Threats to Application-Integrated Large Language Mod- els. CoRR abs/2302.12173, 2023. 13

  34. [34]

    Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns

    Julian Hazell. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns. CoRR abs/2305.06972, 2023. 1, 3, 13

  35. [35]

    MGTBench: Benchmarking Machine-Generated Text Detection

    Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. MGTBench: Benchmarking Machine-Generated Text Detection. CoRR abs/2303.14822, 2023. 13

  36. [36]

    Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

    Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettle- moyer. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), pages 1875–1885. ACL, 2018. 13

  37. [37]

    Is ChatGPT A Good Translator? A Prelim- inary Study

    Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. Is ChatGPT A Good Translator? A Prelim- inary Study. CoRR abs/2301.08745, 2023. 3

  38. [38]

    Perspective API

    Jigsaw. Perspective API. https://www.perspectiveapi. com. 9

  39. [39]

    Is BERT Really Robust? A Strong Baseline for Natural Lan- guage Attack on Text Classification and Entailment

    Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Lan- guage Attack on Text Classification and Entailment. In AAAI Conference on Artificial Intelligence (AAAI) , pages 8018–

  40. [40]

    Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks

    Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks. CoRR abs/2302.05733, 2023. 1, 2, 3, 13

  41. [41]

    On the Reli- ability of Watermarks for Large Language Models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the Reli- ability of Watermarks for Large Language Models. CoRR abs/2306.04634, 2023. 11

  42. [42]

    Multi-step Jailbreaking Privacy Attacks on ChatGPT

    Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12, 13

  43. [43]

    PalmTree: Learning an Assembly Language Model for Instruction Embedding

    Xuezixiang Li, Yu Qu, and Heng Yin. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 3236–3251. ACM, 2021. 3

  44. [44]

    Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services

    Zilong Lin, Jian Cui, Xiaojing Liao, and XiaoFeng Wang. Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services. CoRR abs/2401.03315, 2024. 2, 13

  45. [45]

    Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Nat- ural Language Processing

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- roaki Hayashi, and Graham Neubig. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Nat- ural Language Processing. ACM Computing Surveys , 2023. 3

  46. [46]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023. 12

  47. [47]

    Analyzing Leak- age of Personally Identifiable Information in Language Mod- els

    Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. In IEEE Symposium on Security and Privacy (S&P), pages 346–363. IEEE, 2023. 13

  48. [48]

    A Holistic Approach to Undesired Content Detection in the Real World

    Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloun- dou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A Holistic Approach to Undesired Content Detection in the Real World. CoRR abs/208.03274, 2022. 1, 2, 12

  49. [49]

    UMAP: Uniform Manifold Approximation and Projection

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software, 2018. 5

  50. [50]

    Generalized Louvain method for com- munity detection in large networks

    Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti. Generalized Louvain method for com- munity detection in large networks. In International Confer- ence on Intelligent Systems Design and Applications (ISDA) , pages 88–93. IEEE, 2011. 6, 17

  51. [51]

    Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. ACL, 2022. 18

  52. [52]

    Barbosa, Olivia Figueira, Yang Wang, and Gang Wang

    Jaron Mink, Licheng Luo, Natã M. Barbosa, Olivia Figueira, Yang Wang, and Gang Wang. DeepPhish: Understanding User Trust Towards Artificially Generated Profiles in Online Social Networks. In USENIX Security Symposium (USENIX Security), pages 1669–1686. USENIX, 2022. 13

  53. [53]

    Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks

    Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. In Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 8332–8347. ACL, 2022. 13

  54. [54]

    And We Will Fight For Our Race!

    Alexandros Mittos, Savvas Zannettou, Jeremy Blackburn, and Emiliano De Cristofaro. “And We Will Fight For Our Race!” A Measurement Study of Genetic Testing Conversations on Reddit and 4chan. In International Conference on Web and Social Media (ICWSM), pages 452–463. AAAI, 2020. 3

  55. [55]

    Mark E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences ,

  56. [56]

    AI Risk Management Framework

    NIST. AI Risk Management Framework. https://www. nist.gov/itl/ai-risk-management-framework . 3, 13

  57. [57]

    NeMo-Guardrails

    NVIDIA. NeMo-Guardrails. https://github.com/ NVIDIA/NeMo-Guardrails. 1, 2, 12

  58. [58]

    ChatGPT can now see, hear, and speak

    OpenAI. ChatGPT can now see, hear, and speak. https://openai.com/blog/chatgpt-can-now-see- hear-and-speak. 5

  59. [59]

    Function calling and other API updates

    OpenAI. Function calling and other API updates. https://openai.com/blog/function-calling-and- other-api-updates. 5

  60. [60]

    Moderation Endpoint

    OpenAI. Moderation Endpoint. https://platform. openai.com/docs/guides/moderation/overview. 12 15

  61. [61]

    New models and developer products announced at DevDay

    OpenAI. New models and developer products announced at DevDay. https://openai.com/blog/new-models-and- developer-products-announced-at-devday . 5

  62. [62]

    OpenAI. Pricing. https://openai.com/pricing. 5

  63. [63]

    Usage policies

    OpenAI. Usage policies. https://openai.com/policies/ usage-policies. 2, 8, 20

  64. [64]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. CoRR abs/2303.08774 ,

  65. [65]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...

  66. [66]

    To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023

    Alessandro Pegoraro, Kavita Kumari, Hossein Fereidooni, and Ahmad-Reza Sadeghi. To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023. 3

  67. [67]

    Can Large Language Models Reason about Program Invariants? In International Conference on Machine Learning (ICML)

    Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can Large Language Models Reason about Program Invariants? In International Conference on Machine Learning (ICML). JMLR, 2023. 3

  68. [68]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. CoRR abs/2211.09527 ,

  69. [69]

    Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models

    Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2023. 1, 3, 13

  70. [70]

    r/ChatGPTJailbreak

    Reddit. r/ChatGPTJailbreak. https://www.reddit.com/r/ ChatGPTJailbreak/. 1, 3

  71. [71]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019. 5

  72. [72]

    Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

    Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Associa- tion for Computational Linguistics (ACL) , pages 4902–4912. ACL, 2020. 11

  73. [73]

    Rivers and Bryan L

    Caitlin M. Rivers and Bryan L. Lewis. Ethical research stan- dards in a world of big data. F1000Research, 2014. 3

  74. [74]

    On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

    Omar Shaikh, Hongxin Zhang, William Held, Michael Bern- stein, and Diyi Yang. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 4454–4470. ACL, 2023. 8

  75. [75]

    In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

    Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. CoRR abs/2304.08979, 2023. 13

  76. [76]

    Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

    Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In ACM SIGSAC Con- ference on Computer and Communications Security (CCS) , pages 2659–2673. ACM, 2022. 9

  77. [77]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize from human feed- back. CoRR abs/2009.01325, 2020. 17

  78. [78]

    Blueprint for an AI Bill of Rights

    The White House. Blueprint for an AI Bill of Rights. https: //www.whitehouse.gov/ostp/ai-bill-of-rights/ . 1, 3

  79. [79]

    OPUS-MT - Build- ing open translation services for the World

    Jörg Tiedemann and Santhosh Thottingal. OPUS-MT - Build- ing open translation services for the World. In Conference of the European Association for Machine Translation (EAMT) , pages 479–480. European Association for Machine Transla- tion, 2020. 11

  80. [80]

    OpenChatKit

    Together. OpenChatKit. https://github.com/ togethercomputer/OpenChatKit. 1, 2, 12

Showing first 80 references.