pith. machine review for the scientific record. sign in

arxiv: 2309.10253 · v4 · submitted 2023-09-19 · 💻 cs.AI

Recognition: no theorem link

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords jailbreak attackslarge language modelsfuzz testingred teamingautomated prompt generationLLM safety evaluation
0
0 comments X

The pith

Automated fuzzing of human-written jailbreak seeds produces templates that succeed against ChatGPT and Llama-2 at rates above 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GPTFuzz as a black-box framework that automates jailbreak template creation for red-teaming large language models. It begins with a small set of human-written seed templates, applies mutation operators to generate semantically similar variants, and uses a judgment model to decide which variants count as successful attacks. The central claim is that this process yields higher success rates than manual templates even when the starting seeds are suboptimal. A reader would care because current safety alignments in deployed models can be tested and potentially bypassed at scale without requiring expert prompt engineering for each test case.

Core claim

GPTFuzz starts with human-written jailbreak templates as seeds, selects them for mutation to balance efficiency and diversity, applies operators that rewrite sentences while preserving intent, and feeds the results to a judgment model that labels whether the target LLM produced the requested harmful content. On this basis the framework reports attack success rates exceeding 90 percent on ChatGPT and Llama-2 across multiple scenarios, outperforming the original hand-crafted seeds.

What carries the argument

The fuzzing loop that repeatedly selects seeds, applies semantic-preserving mutation operators, and filters results through a judgment model that scores whether harmful content was elicited.

If this is right

  • Red-teaming large language models can be scaled beyond the limits of manual prompt design.
  • Safety evaluations of commercial and open-source models can be automated and repeated with diverse prompt variations.
  • Even models that resist the original human-written templates remain vulnerable to small rephrasings produced by mutation.
  • Jailbreak success can be achieved without access to model weights or internal states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLM safety training appears sensitive to prompt phrasing variations that preserve meaning, suggesting alignment may be shallower than intended.
  • The same mutation-based generation approach could be applied to discover other failure modes such as factual hallucination or bias amplification.
  • Developers of LLMs may need to incorporate automated red-teaming loops into their training or evaluation pipelines to close these gaps.
  • Over-reliance on any single judgment model risks inflating perceived vulnerability unless cross-validated against human raters.

Load-bearing premise

The separate judgment model correctly labels whether each generated prompt actually succeeded in eliciting harmful output from the target LLM.

What would settle it

Manual inspection of several hundred GPTFuzz outputs and target responses that shows the judgment model misclassifies at least 15 percent of cases as successes when the LLM actually refused or gave safe answers.

read the original abstract

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents GPTFuzz, a black-box fuzzing framework inspired by AFL that automates jailbreak template generation for red-teaming LLMs. It starts with human-written seed templates, applies mutation operators to create variants, uses a seed selection strategy, and employs an LLM-based judgment model to evaluate attack success. Experiments on ChatGPT, LLaMA-2, and Vicuna report attack success rates exceeding 90% even from suboptimal seeds, outperforming manual templates.

Significance. If the results are robust, the work supplies a practical, scalable tool for automated red-teaming that could accelerate safety evaluations of LLMs and motivate stronger alignment techniques. The empirical demonstration across commercial and open-source models, combined with the mutation-based generation approach, offers a concrete advance over purely manual jailbreak crafting.

major comments (2)
  1. [Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.
  2. [Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.
minor comments (2)
  1. [Abstract] The abstract states that GPTFuzz surpasses human-crafted templates but does not quantify the human baseline ASR or describe the exact set of manual templates used for comparison.
  2. [Method] Notation for the three mutation operators and the seed selection scoring function could be introduced more formally with pseudocode or equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.

    Authors: We agree that the accuracy and potential bias of the LLM-based judgment model are critical to the validity of the reported attack success rates. The original manuscript described the judgment model but did not include calibration details. In the revised version, we will add a dedicated subsection that reports a calibration study: a random sample of responses will be independently labeled by multiple human annotators, inter-annotator agreement (Cohen's kappa) will be computed, and false-positive and false-negative rates of the judge relative to human consensus will be presented. This will directly address concerns about systematic bias. revision: yes

  2. Referee: [Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.

    Authors: We acknowledge the absence of these controls and comparisons. In the revision we will add (1) direct comparisons against other automated jailbreak generation methods that were available at the time of submission, (2) results from multiple independent runs of GPTFuzz at different temperature settings, reporting means and standard deviations, and (3) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing GPTFuzz against both human-crafted templates and the added baselines. These additions will demonstrate robustness beyond single-run results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper presents GPTFuzz as an empirical black-box fuzzing framework: human-written seed templates are mutated via operators and evaluated for jailbreak success using a separate judgment model against target LLMs (ChatGPT, LLaMa-2, Vicuna). No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Results (e.g., >90% ASR) are reported from direct evaluations on external models rather than reducing to inputs by construction. The framework relies on external APIs and human seeds, with no load-bearing self-citations or uniqueness claims that collapse the central result. This is a standard empirical methodology without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard fuzzing concepts and LLM capabilities without introducing new free parameters, axioms, or invented entities beyond the described components.

pith-pipeline@v0.9.0 · 5607 in / 955 out tokens · 39063 ms · 2026-05-15T06:20:52.596251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

    cs.CR 2026-05 unverdicted novelty 7.0

    Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...

  2. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 7.0

    Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

  3. On the Hardness of Junking LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

  4. ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

    cs.CL 2026-05 unverdicted novelty 7.0

    ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

  5. The Power of Order: Fooling LLMs with Adversarial Table Permutations

    cs.LG 2026-05 unverdicted novelty 7.0

    Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.

  6. Adaptive Instruction Composition for Automated LLM Red-Teaming

    cs.CR 2026-04 unverdicted novelty 7.0

    Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.

  7. LLM-Agnostic Semantic Representation Attack

    cs.CL 2026-05 unverdicted novelty 6.0

    SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

  8. PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

    cs.HC 2026-05 unverdicted novelty 6.0

    PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...

  9. The Power of Order: Fooling LLMs with Adversarial Table Permutations

    cs.LG 2026-05 unverdicted novelty 6.0

    Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.

  10. FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

    cs.CR 2026-04 unverdicted novelty 6.0

    FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

  11. HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

    cs.CL 2026-04 unverdicted novelty 6.0

    HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization signi...

  12. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

  13. The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

    cs.CR 2026-04 unverdicted novelty 6.0

    Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

  14. Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.

  15. Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruct...

  16. Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

    cs.AI 2026-04 unverdicted novelty 6.0

    CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.

  17. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  18. TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

    cs.CR 2026-04 unverdicted novelty 6.0

    TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

  19. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    cs.CR 2024-03 accept novelty 6.0

    JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...

  20. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

  21. PIArena: A Platform for Prompt Injection Evaluation

    cs.CR 2026-04 unverdicted novelty 5.0

    PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

  22. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 17 internal anchors

  1. [1]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  2. [2]

    Introducing claude

    Anthropic. Introducing claude. https://www.anthropic.com/ index/introducing-claude. Accessed on 08/08/2023

  3. [3]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235– 256, 2002

  4. [4]

    Efficient greybox fuzzing to detect memory errors

    Jinsheng Ba, Gregory J Duck, and Abhik Roychoudhury. Efficient greybox fuzzing to detect memory errors. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineer- ing, pages 1–12, 2022

  5. [5]

    Spinning language models: Risks of propaganda-as-a-service and countermeasures

    Eugene Bagdasaryan and Vitaly Shmatikov. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In Proc. of IEEE Symposium on Security and Privacy (SP). IEEE, 2022

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom 16 Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Baichuan-13b

    Baichuan-Inc. Baichuan-13b. https://github.com/ baichuan-inc/Baichuan-13B. Accessed on 08/08/2023

  8. [8]

    A computer wrote this paper: What chatgpt means for education, research, and writing

    Lea Bishop. A computer wrote this paper: What chatgpt means for education, research, and writing. Research, and Writing (January 26, 2023), 2023

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  10. [10]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  11. [11]

    How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

    Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

  12. [12]

    Chatgpt and other artificial intelligence applications speed up scientific writing

    Tzeng-Ji Chen. Chatgpt and other artificial intelligence applications speed up scientific writing. Journal of the Chinese Medical Association, 86(4):351–353, 2023

  13. [13]

    Efficient selectivity and backup operators in monte- carlo tree search

    Rémi Coulom. Efficient selectivity and backup operators in monte- carlo tree search. Computers and games, 4630:72–83, 2006

  14. [14]

    Jailbreaker: Automated jailbreak across multiple large language model chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023

  15. [15]

    Toxicity in chatgpt: Analyzing persona-assigned language models

    Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023

  16. [16]

    Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage

    TheAnh Do, Alvis Cheuk M Fong, and Russel Pears. Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage. In Evaluation of Novel Approaches to Software Engineer- ing: 7th International Conference, ENASE 2012, Warsaw, Poland, June 29-30, 2012, Revised Selected Papers 7, pages 3–15. Springer, 2013

  17. [17]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021

  18. [18]

    Snipuzz: Black-box fuzzing of iot firmware via message snippet inference

    Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue, Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. Snipuzz: Black-box fuzzing of iot firmware via message snippet inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 337–350, 2021

  19. [19]

    Libafl: A framework to build modular and reusable fuzzers

    Andrea Fioraldi, Dominik Christian Maier, Dongjia Zhang, and Davide Balzarotti. Libafl: A framework to build modular and reusable fuzzers. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1051–1065, 2022

  20. [20]

    Dyta: dynamic symbolic execution guided with static verification results

    Xi Ge, Kunal Taneja, Tao Xie, and Nikolai Tillmann. Dyta: dynamic symbolic execution guided with static verification results. In Proceed- ings of the 33rd International Conference on Software Engineering , pages 992–994, 2011

  21. [21]

    Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models. arXiv preprint arXiv:2009.11462, 2020

  22. [22]

    Google. Bard. https://bard.google.com/. Accessed on 08/08/2023

  23. [23]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023

  24. [24]

    On cal- ibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On cal- ibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017

  25. [25]

    Seed selection for success- ful fuzzing

    Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L Hosking. Seed selection for success- ful fuzzing. In Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis, pages 230–243, 2021

  26. [26]

    Balance seed scheduling via monte carlo planning

    Heqing Huang, Hung-Chun Chiu, Qingkai Shi, Peisen Yao, and Charles Zhang. Balance seed scheduling via monte carlo planning. IEEE Transactions on Dependable and Secure Computing, 2023

  27. [27]

    Diar: Removing un- interesting bytes from seeds in software fuzzing

    Aftab Hussain and Mohammad Amin Alipour. Diar: Removing un- interesting bytes from seeds in software fuzzing. arXiv preprint arXiv:2112.13297, 2021

  28. [28]

    Smart seed selection-based effective black box fuzzing for iiot protocol

    SungJin Kim, Jaeik Cho, Changhoon Lee, and Taeshik Shon. Smart seed selection-based effective black box fuzzing for iiot protocol. The Journal of Supercomputing, 76:10140–10154, 2020

  29. [29]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  30. [30]

    Learning seed-adaptive mutation strategies for greybox fuzzing

    Myungho Lee, Sooyoung Cha, and Hakjoo Oh. Learning seed-adaptive mutation strategies for greybox fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 384–

  31. [31]

    Multi-step jailbreaking privacy attacks on chatgpt

    Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023

  32. [32]

    Multi-target backdoor attacks for code pre-trained models

    Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350, 2023

  33. [33]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Mea- suring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

  34. [34]

    Adversarial training for large neural language models

    Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020

  35. [35]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

  36. [36]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt in- jection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023

  37. [37]

    Jailbreaking chat- gpt via prompt engineering: An empirical study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chat- gpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023

  38. [38]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  39. [39]

    Directed symbolic execution

    Kin-Keung Ma, Khoo Yit Phang, Jeffrey S Foster, and Michael Hicks. Directed symbolic execution. In Static Analysis: 18th International Symposium, SAS 2011, Venice, Italy, September 14-16, 2011. Proceed- ings 18, pages 95–111. Springer, 2011

  40. [40]

    A holistic approach to undesired content detection in the real world

    Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloun- dou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

  41. [41]

    No- table: Transferable backdoor attacks against prompt-based nlp models

    Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. No- table: Transferable backdoor attacks against prompt-based nlp models. arXiv preprint arXiv:2305.17826, 2023. 17

  42. [42]

    An empirical study of the reliability of unix utilities

    Barton P Miller, Lars Fredriksen, and Bryan So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32– 44, 1990

  43. [43]

    Evaluating the robustness of neural language models to input perturbations

    Milad Moradi and Matthias Samwald. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237, 2021

  44. [44]

    Introducing chatgpt

    OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt,

  45. [45]

    Accessed: 08/08/2023

  46. [46]

    Forecasting potential misuses of language models for disin- formation campaigns and how to reduce risk

    OpenAI. Forecasting potential misuses of language models for disin- formation campaigns and how to reduce risk. https://openai.com/ research/forecasting-misuse, 2023. Accessed: 08/08/2023

  47. [47]

    Function calling and other api updates

    OpenAI. Function calling and other api updates. https://openai. com/blog/function-calling-and-other-api-updates , 2023. Accessed: 08/08/2023

  48. [48]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

  49. [49]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  50. [50]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

  51. [51]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

  52. [52]

    Drifuzz: Har- vesting bugs in device drivers from golden seeds

    Zekun Shen, Ritik Roongta, and Brendan Dolan-Gavitt. Drifuzz: Har- vesting bugs in device drivers from golden seeds. In 31st USENIX Security Symposium (USENIX Security 22), pages 1275–1290, 2022

  53. [53]

    Process for adapting language models to society (palms) with values-targeted datasets

    Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021

  54. [54]

    Safety assessment of chinese large language models

    Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023

  55. [55]

    Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023

    MosaicML NLP Team. Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023. Accessed: 08/08/2023

  56. [56]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  57. [57]

    Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injec- tion vulnerabilities

    Erik Trickel, Fabio Pagani, Chang Zhu, Lukas Dresel, Giovanni Vigna, Christopher Kruegel, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, and Adam Doupé. Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injec- tion vulnerabilities. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2658–267...

  58. [58]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  59. [59]

    syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023

    Dmitry Vyukov and the syzkaller contributors. syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023. Accessed: 08/08/2023

  60. [60]

    Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

    Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023

  61. [61]

    {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning

    Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 2741–2758, 2021

  62. [62]

    Is chatgpt a good nlg evaluator? a preliminary study

    Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023

  63. [63]

    Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing

    Jinghan Wang, Chengyu Song, and Heng Yin. Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing. 2021

  64. [64]

    Self-critique prompting with large language models for inductive instructions

    Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam- Fai Wong. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733, 2023

  65. [65]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

  66. [66]

    Singular- ity: Pattern fuzzing for worst case complexity

    Jiayi Wei, Jia Chen, Yu Feng, Kostas Ferles, and Isil Dillig. Singular- ity: Pattern fuzzing for worst case complexity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, pages 213–223, 2018

  67. [67]

    Challenges in detoxifying language models

    Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021

  68. [68]

    One fuzzing strategy to rule them all

    Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering, pages 1634–1645, 2022

  69. [69]

    Cvalues: Measuring the values of chinese large language models from safety to responsibility

    Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705, 2023

  70. [70]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

  71. [71]

    {EcoFuzz}: Adaptive {Energy-Saving} greybox fuzzing as a variant of the adversarial {Multi-Armed} bandit

    Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou. {EcoFuzz}: Adaptive {Energy-Saving} greybox fuzzing as a variant of the adversarial {Multi-Armed} bandit. In 29th USENIX Security Symposium (USENIX Security 20), pages 2307–2324, 2020

  72. [72]

    American fuzzy lop

    Michał Zalewski. American fuzzy lop. http://lcamtuf.coredump. cx/afl/, 2023. Accessed: 08/08/2023

  73. [73]

    Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing

    G Zhang, P Wang, T Yue, X Kong, S Huang, X Zhou, and K Lu. Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing. In Network and Distributed Systems Security (NDSS) Symposium, volume 2022, 2022

  74. [74]

    Fine-mixing: Mitigating backdoors in fine-tuned language models

    Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models. arXiv preprint arXiv:2210.09545, 2022

  75. [75]

    Evolutionary mutation-based fuzzing as monte carlo tree search

    Yiru Zhao, Xiaoke Wang, Lei Zhao, Yueqiang Cheng, and Heng Yin. Evolutionary mutation-based fuzzing as monte carlo tree search. arXiv preprint arXiv:2101.00612, 2021

  76. [76]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  77. [77]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine- tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

  78. [78]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 18 A Datasets A.1 Jailbreak Templates As stated in Section 4.1, we sample 77 jailbreak templates from previous work [37], which are collected from online shared jailbreak templates.1 He...

  79. [79]

    We can see that although both models can be jailbroken for all 100 questions, gpt-3.5-turbo- 0631 demonstrates enhanced robustness compared to gpt- 3.5-turbo-0301

    and an earlier version released in March(gpt-3.5-turbo- 0301), as detailed in Table 6. We can see that although both models can be jailbroken for all 100 questions, gpt-3.5-turbo- 0631 demonstrates enhanced robustness compared to gpt- 3.5-turbo-0301. The average successful templates on gpt-3.5- turbo-0301 is 51.55, which is much higher than the latest ver...