Recognition: no theorem link
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3
The pith
Automated fuzzing of human-written jailbreak seeds produces templates that succeed against ChatGPT and Llama-2 at rates above 90 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPTFuzz starts with human-written jailbreak templates as seeds, selects them for mutation to balance efficiency and diversity, applies operators that rewrite sentences while preserving intent, and feeds the results to a judgment model that labels whether the target LLM produced the requested harmful content. On this basis the framework reports attack success rates exceeding 90 percent on ChatGPT and Llama-2 across multiple scenarios, outperforming the original hand-crafted seeds.
What carries the argument
The fuzzing loop that repeatedly selects seeds, applies semantic-preserving mutation operators, and filters results through a judgment model that scores whether harmful content was elicited.
If this is right
- Red-teaming large language models can be scaled beyond the limits of manual prompt design.
- Safety evaluations of commercial and open-source models can be automated and repeated with diverse prompt variations.
- Even models that resist the original human-written templates remain vulnerable to small rephrasings produced by mutation.
- Jailbreak success can be achieved without access to model weights or internal states.
Where Pith is reading between the lines
- LLM safety training appears sensitive to prompt phrasing variations that preserve meaning, suggesting alignment may be shallower than intended.
- The same mutation-based generation approach could be applied to discover other failure modes such as factual hallucination or bias amplification.
- Developers of LLMs may need to incorporate automated red-teaming loops into their training or evaluation pipelines to close these gaps.
- Over-reliance on any single judgment model risks inflating perceived vulnerability unless cross-validated against human raters.
Load-bearing premise
The separate judgment model correctly labels whether each generated prompt actually succeeded in eliciting harmful output from the target LLM.
What would settle it
Manual inspection of several hundred GPTFuzz outputs and target responses that shows the judgment model misclassifies at least 15 percent of cases as successes when the LLM actually refused or gave safe answers.
read the original abstract
Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GPTFuzz, a black-box fuzzing framework inspired by AFL that automates jailbreak template generation for red-teaming LLMs. It starts with human-written seed templates, applies mutation operators to create variants, uses a seed selection strategy, and employs an LLM-based judgment model to evaluate attack success. Experiments on ChatGPT, LLaMA-2, and Vicuna report attack success rates exceeding 90% even from suboptimal seeds, outperforming manual templates.
Significance. If the results are robust, the work supplies a practical, scalable tool for automated red-teaming that could accelerate safety evaluations of LLMs and motivate stronger alignment techniques. The empirical demonstration across commercial and open-source models, combined with the mutation-based generation approach, offers a concrete advance over purely manual jailbreak crafting.
major comments (2)
- [Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.
- [Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.
minor comments (2)
- [Abstract] The abstract states that GPTFuzz surpasses human-crafted templates but does not quantify the human baseline ASR or describe the exact set of manual templates used for comparison.
- [Method] Notation for the three mutation operators and the seed selection scoring function could be introduced more formally with pseudocode or equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.
Authors: We agree that the accuracy and potential bias of the LLM-based judgment model are critical to the validity of the reported attack success rates. The original manuscript described the judgment model but did not include calibration details. In the revised version, we will add a dedicated subsection that reports a calibration study: a random sample of responses will be independently labeled by multiple human annotators, inter-annotator agreement (Cohen's kappa) will be computed, and false-positive and false-negative rates of the judge relative to human consensus will be presented. This will directly address concerns about systematic bias. revision: yes
-
Referee: [Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.
Authors: We acknowledge the absence of these controls and comparisons. In the revision we will add (1) direct comparisons against other automated jailbreak generation methods that were available at the time of submission, (2) results from multiple independent runs of GPTFuzz at different temperature settings, reporting means and standard deviations, and (3) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing GPTFuzz against both human-crafted templates and the added baselines. These additions will demonstrate robustness beyond single-run results. revision: yes
Circularity Check
No significant circularity in empirical framework
full rationale
The paper presents GPTFuzz as an empirical black-box fuzzing framework: human-written seed templates are mutated via operators and evaluated for jailbreak success using a separate judgment model against target LLMs (ChatGPT, LLaMa-2, Vicuna). No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Results (e.g., >90% ASR) are reported from direct evaluations on external models rather than reducing to inputs by construction. The framework relies on external APIs and human seeds, with no load-bearing self-citations or uniqueness claims that collapse the central result. This is a standard empirical methodology without circular reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
-
On the Hardness of Junking LLMs
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
The Power of Order: Fooling LLMs with Adversarial Table Permutations
Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
-
Adaptive Instruction Composition for Automated LLM Red-Teaming
Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
-
LLM-Agnostic Semantic Representation Attack
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
-
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
-
The Power of Order: Fooling LLMs with Adversarial Table Permutations
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
-
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization signi...
-
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
-
Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models
ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruct...
-
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
PIArena: A Platform for Prompt Injection Evaluation
PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic. Introducing claude. https://www.anthropic.com/ index/introducing-claude. Accessed on 08/08/2023
work page 2023
-
[3]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235– 256, 2002
work page 2002
-
[4]
Efficient greybox fuzzing to detect memory errors
Jinsheng Ba, Gregory J Duck, and Abhik Roychoudhury. Efficient greybox fuzzing to detect memory errors. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineer- ing, pages 1–12, 2022
work page 2022
-
[5]
Spinning language models: Risks of propaganda-as-a-service and countermeasures
Eugene Bagdasaryan and Vitaly Shmatikov. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In Proc. of IEEE Symposium on Security and Privacy (SP). IEEE, 2022
work page 2022
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom 16 Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Baichuan-Inc. Baichuan-13b. https://github.com/ baichuan-inc/Baichuan-13B. Accessed on 08/08/2023
work page 2023
-
[8]
A computer wrote this paper: What chatgpt means for education, research, and writing
Lea Bishop. A computer wrote this paper: What chatgpt means for education, research, and writing. Research, and Writing (January 26, 2023), 2023
work page 2023
-
[9]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[10]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023
Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023
-
[12]
Chatgpt and other artificial intelligence applications speed up scientific writing
Tzeng-Ji Chen. Chatgpt and other artificial intelligence applications speed up scientific writing. Journal of the Chinese Medical Association, 86(4):351–353, 2023
work page 2023
-
[13]
Efficient selectivity and backup operators in monte- carlo tree search
Rémi Coulom. Efficient selectivity and backup operators in monte- carlo tree search. Computers and games, 4630:72–83, 2006
work page 2006
-
[14]
Jailbreaker: Automated jailbreak across multiple large language model chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023
-
[15]
Toxicity in chatgpt: Analyzing persona-assigned language models
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023
-
[16]
Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage
TheAnh Do, Alvis Cheuk M Fong, and Russel Pears. Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage. In Evaluation of Novel Approaches to Software Engineer- ing: 7th International Conference, ENASE 2012, Warsaw, Poland, June 29-30, 2012, Revised Selected Papers 7, pages 3–15. Springer, 2013
work page 2012
-
[17]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021
-
[18]
Snipuzz: Black-box fuzzing of iot firmware via message snippet inference
Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue, Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. Snipuzz: Black-box fuzzing of iot firmware via message snippet inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 337–350, 2021
work page 2021
-
[19]
Libafl: A framework to build modular and reusable fuzzers
Andrea Fioraldi, Dominik Christian Maier, Dongjia Zhang, and Davide Balzarotti. Libafl: A framework to build modular and reusable fuzzers. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1051–1065, 2022
work page 2022
-
[20]
Dyta: dynamic symbolic execution guided with static verification results
Xi Ge, Kunal Taneja, Tao Xie, and Nikolai Tillmann. Dyta: dynamic symbolic execution guided with static verification results. In Proceed- ings of the 33rd International Conference on Software Engineering , pages 992–994, 2011
work page 2011
-
[21]
Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models. arXiv preprint arXiv:2009.11462, 2020
-
[22]
Google. Bard. https://bard.google.com/. Accessed on 08/08/2023
work page 2023
-
[23]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
On cal- ibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On cal- ibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017
work page 2017
-
[25]
Seed selection for success- ful fuzzing
Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L Hosking. Seed selection for success- ful fuzzing. In Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis, pages 230–243, 2021
work page 2021
-
[26]
Balance seed scheduling via monte carlo planning
Heqing Huang, Hung-Chun Chiu, Qingkai Shi, Peisen Yao, and Charles Zhang. Balance seed scheduling via monte carlo planning. IEEE Transactions on Dependable and Secure Computing, 2023
work page 2023
-
[27]
Diar: Removing un- interesting bytes from seeds in software fuzzing
Aftab Hussain and Mohammad Amin Alipour. Diar: Removing un- interesting bytes from seeds in software fuzzing. arXiv preprint arXiv:2112.13297, 2021
-
[28]
Smart seed selection-based effective black box fuzzing for iiot protocol
SungJin Kim, Jaeik Cho, Changhoon Lee, and Taeshik Shon. Smart seed selection-based effective black box fuzzing for iiot protocol. The Journal of Supercomputing, 76:10140–10154, 2020
work page 2020
-
[29]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Learning seed-adaptive mutation strategies for greybox fuzzing
Myungho Lee, Sooyoung Cha, and Hakjoo Oh. Learning seed-adaptive mutation strategies for greybox fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 384–
work page 2023
-
[31]
Multi-step jailbreaking privacy attacks on chatgpt
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023
-
[32]
Multi-target backdoor attacks for code pre-trained models
Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350, 2023
-
[33]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Mea- suring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Adversarial training for large neural language models
Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020
-
[35]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt in- jection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Jailbreaking chat- gpt via prompt engineering: An empirical study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chat- gpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023
-
[38]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[39]
Kin-Keung Ma, Khoo Yit Phang, Jeffrey S Foster, and Michael Hicks. Directed symbolic execution. In Static Analysis: 18th International Symposium, SAS 2011, Venice, Italy, September 14-16, 2011. Proceed- ings 18, pages 95–111. Springer, 2011
work page 2011
-
[40]
A holistic approach to undesired content detection in the real world
Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloun- dou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023
work page 2023
-
[41]
No- table: Transferable backdoor attacks against prompt-based nlp models
Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. No- table: Transferable backdoor attacks against prompt-based nlp models. arXiv preprint arXiv:2305.17826, 2023. 17
-
[42]
An empirical study of the reliability of unix utilities
Barton P Miller, Lars Fredriksen, and Bryan So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32– 44, 1990
work page 1990
-
[43]
Evaluating the robustness of neural language models to input perturbations
Milad Moradi and Matthias Samwald. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237, 2021
- [44]
-
[45]
Accessed: 08/08/2023
work page 2023
-
[46]
OpenAI. Forecasting potential misuses of language models for disin- formation campaigns and how to reduce risk. https://openai.com/ research/forecasting-misuse, 2023. Accessed: 08/08/2023
work page 2023
-
[47]
Function calling and other api updates
OpenAI. Function calling and other api updates. https://openai. com/blog/function-calling-and-other-api-updates , 2023. Accessed: 08/08/2023
work page 2023
-
[48]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[50]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023
work page internal anchor Pith review arXiv 2023
-
[52]
Drifuzz: Har- vesting bugs in device drivers from golden seeds
Zekun Shen, Ritik Roongta, and Brendan Dolan-Gavitt. Drifuzz: Har- vesting bugs in device drivers from golden seeds. In 31st USENIX Security Symposium (USENIX Security 22), pages 1275–1290, 2022
work page 2022
-
[53]
Process for adapting language models to society (palms) with values-targeted datasets
Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021
work page 2021
-
[54]
Safety assessment of chinese large language models
Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023
-
[55]
Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023
MosaicML NLP Team. Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023. Accessed: 08/08/2023
work page 2023
-
[56]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Erik Trickel, Fabio Pagani, Chang Zhu, Lukas Dresel, Giovanni Vigna, Christopher Kruegel, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, and Adam Doupé. Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injec- tion vulnerabilities. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2658–267...
work page 2023
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[59]
syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023
Dmitry Vyukov and the syzkaller contributors. syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023. Accessed: 08/08/2023
work page 2023
-
[60]
Decodingtrust: A comprehensive assessment of trustworthiness in gpt models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023
-
[61]
{SyzVegas}: Beating kernel fuzzing odds with reinforcement learning
Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 2741–2758, 2021
work page 2021
-
[62]
Is chatgpt a good nlg evaluator? a preliminary study
Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023
-
[63]
Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing
Jinghan Wang, Chengyu Song, and Heng Yin. Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing. 2021
work page 2021
-
[64]
Self-critique prompting with large language models for inductive instructions
Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam- Fai Wong. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733, 2023
-
[65]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Singular- ity: Pattern fuzzing for worst case complexity
Jiayi Wei, Jia Chen, Yu Feng, Kostas Ferles, and Isil Dillig. Singular- ity: Pattern fuzzing for worst case complexity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, pages 213–223, 2018
work page 2018
-
[67]
Challenges in detoxifying language models
Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021
-
[68]
One fuzzing strategy to rule them all
Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering, pages 1634–1645, 2022
work page 2022
-
[69]
Cvalues: Measuring the values of chinese large language models from safety to responsibility
Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705, 2023
-
[70]
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023
work page 2023
-
[71]
Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou. {EcoFuzz}: Adaptive {Energy-Saving} greybox fuzzing as a variant of the adversarial {Multi-Armed} bandit. In 29th USENIX Security Symposium (USENIX Security 20), pages 2307–2324, 2020
work page 2020
-
[72]
Michał Zalewski. American fuzzy lop. http://lcamtuf.coredump. cx/afl/, 2023. Accessed: 08/08/2023
work page 2023
-
[73]
Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing
G Zhang, P Wang, T Yue, X Kong, S Huang, X Zhou, and K Lu. Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing. In Network and Distributed Systems Security (NDSS) Symposium, volume 2022, 2022
work page 2022
-
[74]
Fine-mixing: Mitigating backdoors in fine-tuned language models
Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models. arXiv preprint arXiv:2210.09545, 2022
-
[75]
Evolutionary mutation-based fuzzing as monte carlo tree search
Yiru Zhao, Xiaoke Wang, Lei Zhao, Yueqiang Cheng, and Heng Yin. Evolutionary mutation-based fuzzing as monte carlo tree search. arXiv preprint arXiv:2101.00612, 2021
-
[76]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine- tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[78]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 18 A Datasets A.1 Jailbreak Templates As stated in Section 4.1, we sample 77 jailbreak templates from previous work [37], which are collected from online shared jailbreak templates.1 He...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
and an earlier version released in March(gpt-3.5-turbo- 0301), as detailed in Table 6. We can see that although both models can be jailbroken for all 100 questions, gpt-3.5-turbo- 0631 demonstrates enhanced robustness compared to gpt- 3.5-turbo-0301. The average successful templates on gpt-3.5- turbo-0301 is 51.55, which is much higher than the latest ver...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.