Recognition: 2 theorem links
· Lean Theorem"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Pith reviewed 2026-05-17 08:29 UTC · model grok-4.3
The pith
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4
Load-bearing premise
The 1,405 collected prompts and the 107,250-question set across 13 scenarios are representative enough to support broad conclusions about the inadequacy of safeguards on all LLMs.
read the original abstract
The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the JailbreakHub framework to analyze 1,405 jailbreak prompts collected from December 2022 to December 2023 across 131 communities. It identifies characteristics, attack strategies such as prompt injection and privilege escalation, and trends like shifts to prompt-aggregation sites and persistent optimization by 28 accounts. The authors construct a dataset of 107,250 questions spanning 13 forbidden scenarios and evaluate six popular LLMs, reporting that five jailbreak prompts achieve 0.95 attack success rates on GPT-3.5 and GPT-4, concluding that current safeguards cannot adequately defend against jailbreaks in all scenarios.
Significance. If the collected prompts and scenarios prove representative, this work offers a valuable large-scale empirical characterization of real-world jailbreak techniques and a reusable question dataset that can drive improvements in LLM safety. The longitudinal tracking of prompt persistence (e.g., one effective prompt lasting over 240 days) and the scale of testing across multiple models provide concrete evidence of safeguard limitations that is useful for both researchers and vendors.
major comments (2)
- The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.
- In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.
minor comments (2)
- The abstract introduces JailbreakHub without a one-sentence overview of its main components or data pipeline, which would help readers quickly grasp the contribution.
- Ensure all figures and tables include clear captions explaining the 13 scenarios and how attack success is measured.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our manuscript. We address each major comment point-by-point below and indicate the revisions we will make in the next version of the paper.
read point-by-point responses
-
Referee: The central claim that safeguards 'cannot adequately defend jailbreak prompts in all scenarios' depends on the representativeness of the 1,405 prompts and 107,250-question set. The abstract and methods description provide no detail on sampling methodology from the 131 communities, exclusion criteria, or how the 13 scenarios were defined and validated, leaving open the possibility of selection bias that would weaken the generalization.
Authors: We appreciate the referee's point on the need for explicit methodological transparency to support claims of representativeness. The full manuscript describes identifying 131 communities through systematic searches across platforms including Reddit, Discord, and specialized forums, followed by collection of prompts that explicitly attempt to bypass LLM safeguards. The 13 forbidden scenarios were derived from prohibited categories in the usage policies of major LLM providers (e.g., OpenAI, Anthropic) and aligned with prior safety evaluation benchmarks. However, we acknowledge that a more detailed account of sampling methodology, precise exclusion criteria (such as requiring clear jailbreak intent and excluding duplicates or non-functional prompts), and validation steps for the scenarios would help readers assess potential selection bias. We will add a dedicated subsection to the Methods section outlining these processes, including any acknowledged limitations on generalizability. This revision will directly address the concern. revision: yes
-
Referee: In the experimental evaluation section, the reported 0.95 ASR for the top five prompts on GPT-3.5 and GPT-4 is a key quantitative result. The paper must specify the precise definition of attack success rate, the number of trials per prompt-scenario combination, response parsing rules, and any controls for stochasticity or refusal variability to support the claim of inadequacy across all scenarios.
Authors: We agree that precise experimental details are necessary to substantiate the quantitative results and the broader claim. In the manuscript, attack success rate (ASR) is defined as the proportion of model responses that successfully produce the requested harmful or forbidden content without issuing a refusal. To mitigate stochasticity, we ran three independent trials for each prompt-scenario combination using consistent prompt formatting and reported the averaged ASR. Response evaluation combined automated detection of common refusal keywords and phrases with manual review of edge cases. We will expand the Experimental Evaluation section to explicitly document the ASR definition, trial counts, parsing methodology, and controls for variability (such as noting default temperature settings and refusal patterns). These additions will provide the requested rigor without altering the reported findings. revision: yes
Circularity Check
No circularity: purely empirical collection and measurement study
full rationale
The paper performs data collection of 1,405 jailbreak prompts from 131 external online communities over a one-year period, constructs an independent question set of 107,250 samples across 13 scenarios, and reports direct attack-success measurements on six public LLMs. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the derivation chain; all reported results (including the 0.95 ASR on five prompts) are obtained by applying the collected prompts to external models and counting observed successes. The work is therefore self-contained against external benchmarks and contains no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Jailbreak prompts can be reliably identified and categorized from public online sources.
- domain assumption Attack success rate measured on a fixed question set reflects real-world harm potential.
Forward citations
Cited by 18 Pith papers
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
-
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
-
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
-
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.
-
Exclusive Unlearning
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
A StrongREJECT for Empty Jailbreaks
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
-
Low-Resource Languages Jailbreak GPT-4
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
-
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
A pro-innovation approach to AI regulation. https: //assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/ 1146542/a_pro-innovation_approach_to_AI_ regulation.pdf. 1, 3
- [2]
-
[3]
https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts
Awesome ChatGPT Prompts. https://huggingface.co/ datasets/fka/awesome-chatgpt-prompts. 4
- [4]
- [5]
- [6]
- [7]
-
[8]
General Data Protection Regulation (GDPR). https:// gdpr-info.eu/. 3
- [9]
-
[10]
http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm
Measures for the Management of Generative Artificial Intelli- gence Services. http://www.cac.gov.cn/2023-07/13/c_ 1690898327029107.htm. 1, 3
work page 2023
-
[11]
https:// artificialintelligenceact.eu/
The Artificial Intelligence Act. https:// artificialintelligenceact.eu/. 1, 3, 13
-
[12]
Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin De- hghani, Juan Diego Bermeo, Maria Korobeynikova, and Fab- rizio Gilardi. Open-Source Large Language Models Out- perform Crowd Workers and Approach ChatGPT in Text- Annotation Tasks. CoRR abs/2307.02179, 2023. 17
-
[13]
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Ke- fan Xiao, Yuanzhong Xu, Yuji...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures
Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Coun- termeasures. In IEEE Symposium on Security and Privacy (S&P), pages 769–786. IEEE, 2022. 13
work page 2022
-
[15]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V . Do, Yan Xu, and Pascale Fung. A Multitask, Multilingual, Multimodal Evaluation of Chat- GPT on Reasoning, Hallucination, and Interactivity. CoRR abs/2302.04023, 2023. 3
-
[16]
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. The Pushshift Reddit Dataset. In International Conference on Web and Social Me- dia (ICWSM), pages 830–839. AAAI, 2020. 4
work page 2020
-
[17]
Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale
Marzieh Bitaab, Haehyun Cho, Adam Oest, Zhuoer Lyu, Wei Wang, Jorij Abraham, Ruoyu Wang, Tiffany Bao, Yan Shoshi- taishvili, and Adam Doupé. Beyond Phish: Toward Detecting Fraudulent e-Commerce Websites at Scale. In IEEE Sympo- sium on Security and Privacy (S&P), pages 2566–2583. IEEE,
-
[18]
Bad Characters: Imperceptible NLP Attacks
Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nico- las Papernot. Bad Characters: Imperceptible NLP Attacks. In IEEE Symposium on Security and Privacy (S&P) , pages 1987–2004. IEEE, 2022. 13
work page 1987
-
[19]
Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. In USENIX Security Symposium (USENIX Se- curity), pages 2633–2650. USENIX, 2021. 13
work page 2021
-
[20]
OPWNAI : Cybercriminals Starting To Use ChatGPT
Checkpoint. OPWNAI : Cybercriminals Starting To Use ChatGPT. https://research.checkpoint.com/ 2023/opwnai-cybercriminals-starting-to-use- chatgpt/#single-post, April 2023. 1
work page 2023
-
[21]
BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements
Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. In Annual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 13
work page 2021
-
[22]
PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English
Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, and Kai-Wei Chang. PLUE: Language Understanding Evaluation Bench- mark for Privacy Policies in English. In Annual Meeting of the Association for Computational Linguistics (ACL) , pages 352–365. ACL, 2023. 3
work page 2023
-
[23]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 3, 8, 17
work page 2023
-
[24]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In Annual Conference on Neural Information Processing Systems (NIPS), pages 4299–
-
[25]
Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. 3, 8, 17
work page 2023
-
[26]
Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots. CoRR abs/2307.08715, 2023. 12
- [27]
-
[28]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InAn- nual Meeting of the Association for Computational Linguistics (ACL), pages 320–335. ACL, 2022. 17
work page 2022
-
[29]
Fleiss’ kappa statistic without paradoxes
Rosa Falotico and Piero Quatto. Fleiss’ kappa statistic without paradoxes. Quality & Quantity, 2015. 5 14
work page 2015
-
[30]
Yunhe Feng, Pradhyumna Poralla, Swagatika Dash, Kaicheng Li, Vrushabh Desai, and Meikang Qiu. The Impact of Chat- GPT on Streaming Media: A Crowdsourced and Data-Driven Analysis using Twitter and Reddit. In IEEE International Conference on Big Data Security on Cloud, High Performance and Smart Computing and Intelligent Data and Security (Big- DataSecurity...
work page 2023
-
[31]
FlowGPT. Paraphrase a text. https://flowgpt.com/p/ paraphrase-a-text. 11
-
[32]
Google. AI ACROSS GOOGLE: PaLM 2. https://ai. google/discover/palm2/. 1, 8
-
[33]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injec- tion Threats to Application-Integrated Large Language Mod- els. CoRR abs/2302.12173, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
Julian Hazell. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns. CoRR abs/2305.06972, 2023. 1, 3, 13
-
[35]
MGTBench: Benchmarking Machine-Generated Text Detection
Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. MGTBench: Benchmarking Machine-Generated Text Detection. CoRR abs/2303.14822, 2023. 13
-
[36]
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettle- moyer. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL- HLT), pages 1875–1885. ACL, 2018. 13
work page 2018
-
[37]
Is ChatGPT A Good Translator? A Prelim- inary Study
Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. Is ChatGPT A Good Translator? A Prelim- inary Study. CoRR abs/2301.08745, 2023. 3
- [38]
-
[39]
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Lan- guage Attack on Text Classification and Entailment. In AAAI Conference on Artificial Intelligence (AAAI) , pages 8018–
-
[40]
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security At- tacks. CoRR abs/2302.05733, 2023. 1, 2, 3, 13
-
[41]
On the Reli- ability of Watermarks for Large Language Models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the Reli- ability of Watermarks for Large Language Models. CoRR abs/2306.04634, 2023. 11
-
[42]
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12, 13
-
[43]
PalmTree: Learning an Assembly Language Model for Instruction Embedding
Xuezixiang Li, Yu Qu, and Heng Yin. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 3236–3251. ACM, 2021. 3
work page 2021
-
[44]
Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services
Zilong Lin, Jian Cui, Xiaojing Liao, and XiaoFeng Wang. Malla: Demystifying Real-world Large Language Model In- tegrated Malicious Services. CoRR abs/2401.03315, 2024. 2, 13
-
[45]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- roaki Hayashi, and Graham Neubig. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Nat- ural Language Processing. ACM Computing Surveys , 2023. 3
work page 2023
-
[46]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jail- breaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Analyzing Leak- age of Personally Identifiable Information in Language Mod- els
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. In IEEE Symposium on Security and Privacy (S&P), pages 346–363. IEEE, 2023. 13
work page 2023
-
[48]
A Holistic Approach to Undesired Content Detection in the Real World
Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloun- dou, Teddy Lee, Steven Adler, Angela Jiang, and Lilian Weng. A Holistic Approach to Undesired Content Detection in the Real World. CoRR abs/208.03274, 2022. 1, 2, 12
work page 2022
-
[49]
UMAP: Uniform Manifold Approximation and Projection
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software, 2018. 5
work page 2018
-
[50]
Generalized Louvain method for com- munity detection in large networks
Pasquale De Meo, Emilio Ferrara, Giacomo Fiumara, and Alessandro Provetti. Generalized Louvain method for com- munity detection in large networks. In International Confer- ence on Intelligent Systems Design and Applications (ISDA) , pages 88–93. IEEE, 2011. 6, 17
work page 2011
-
[51]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethink- ing the Role of Demonstrations: What Makes In-Context Learning Work? In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11048–11064. ACL, 2022. 18
work page 2022
-
[52]
Barbosa, Olivia Figueira, Yang Wang, and Gang Wang
Jaron Mink, Licheng Luo, Natã M. Barbosa, Olivia Figueira, Yang Wang, and Gang Wang. DeepPhish: Understanding User Trust Towards Artificially Generated Profiles in Online Social Networks. In USENIX Security Symposium (USENIX Security), pages 1669–1686. USENIX, 2022. 13
work page 2022
-
[53]
Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks
Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. In Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 8332–8347. ACL, 2022. 13
work page 2022
-
[54]
And We Will Fight For Our Race!
Alexandros Mittos, Savvas Zannettou, Jeremy Blackburn, and Emiliano De Cristofaro. “And We Will Fight For Our Race!” A Measurement Study of Genetic Testing Conversations on Reddit and 4chan. In International Conference on Web and Social Media (ICWSM), pages 452–463. AAAI, 2020. 3
work page 2020
-
[55]
Mark E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences ,
-
[56]
NIST. AI Risk Management Framework. https://www. nist.gov/itl/ai-risk-management-framework . 3, 13
-
[57]
NVIDIA. NeMo-Guardrails. https://github.com/ NVIDIA/NeMo-Guardrails. 1, 2, 12
-
[58]
ChatGPT can now see, hear, and speak
OpenAI. ChatGPT can now see, hear, and speak. https://openai.com/blog/chatgpt-can-now-see- hear-and-speak. 5
-
[59]
Function calling and other API updates
OpenAI. Function calling and other API updates. https://openai.com/blog/function-calling-and- other-api-updates. 5
-
[60]
OpenAI. Moderation Endpoint. https://platform. openai.com/docs/guides/moderation/overview. 12 15
-
[61]
New models and developer products announced at DevDay
OpenAI. New models and developer products announced at DevDay. https://openai.com/blog/new-models-and- developer-products-announced-at-devday . 5
-
[62]
OpenAI. Pricing. https://openai.com/pricing. 5
-
[63]
OpenAI. Usage policies. https://openai.com/policies/ usage-policies. 2, 8, 20
-
[64]
OpenAI. GPT-4 Technical Report. CoRR abs/2303.08774 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human f...
work page 2022
-
[66]
To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023
Alessandro Pegoraro, Kavita Kumari, Hossein Fereidooni, and Ahmad-Reza Sadeghi. To ChatGPT, or not to ChatGPT: That is the question! CoRR abs/2304.01487, 2023. 3
-
[67]
Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. Can Large Language Models Reason about Program Invariants? In International Conference on Machine Learning (ICML). JMLR, 2023. 3
work page 2023
-
[68]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. CoRR abs/2211.09527 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models
Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafe Diffusion: On the Gen- eration of Unsafe Images and Hateful Memes From Text-To- Image Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2023. 1, 3, 13
work page 2023
-
[70]
Reddit. r/ChatGPTJailbreak. https://www.reddit.com/r/ ChatGPTJailbreak/. 1, 3
-
[71]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019. 5
work page 2019
-
[72]
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Annual Meeting of the Associa- tion for Computational Linguistics (ACL) , pages 4902–4912. ACL, 2020. 11
work page 2020
-
[73]
Caitlin M. Rivers and Bryan L. Lewis. Ethical research stan- dards in a world of big data. F1000Research, 2014. 3
work page 2014
-
[74]
On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
Omar Shaikh, Hongxin Zhang, William Held, Michael Bern- stein, and Diyi Yang. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 4454–4470. ACL, 2023. 8
work page 2023
-
[75]
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. CoRR abs/2304.08979, 2023. 13
-
[76]
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, and Yang Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In ACM SIGSAC Con- ference on Computer and Communications Security (CCS) , pages 2659–2673. ACM, 2022. 9
work page 2022
-
[77]
Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize from human feed- back. CoRR abs/2009.01325, 2020. 17
-
[78]
Blueprint for an AI Bill of Rights
The White House. Blueprint for an AI Bill of Rights. https: //www.whitehouse.gov/ostp/ai-bill-of-rights/ . 1, 3
-
[79]
OPUS-MT - Build- ing open translation services for the World
Jörg Tiedemann and Santhosh Thottingal. OPUS-MT - Build- ing open translation services for the World. In Conference of the European Association for Machine Translation (EAMT) , pages 479–480. European Association for Machine Transla- tion, 2020. 11
work page 2020
-
[80]
Together. OpenChatKit. https://github.com/ togethercomputer/OpenChatKit. 1, 2, 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.