arxiv: 2309.10253 · v4 · submitted 2023-09-19 · 💻 cs.AI

Recognition: no theorem link

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu , Xingwei Lin , Zheng Yu , Xinyu Xing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords jailbreak attackslarge language modelsfuzz testingred teamingautomated prompt generationLLM safety evaluation

0 comments

The pith

Automated fuzzing of human-written jailbreak seeds produces templates that succeed against ChatGPT and Llama-2 at rates above 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GPTFuzz as a black-box framework that automates jailbreak template creation for red-teaming large language models. It begins with a small set of human-written seed templates, applies mutation operators to generate semantically similar variants, and uses a judgment model to decide which variants count as successful attacks. The central claim is that this process yields higher success rates than manual templates even when the starting seeds are suboptimal. A reader would care because current safety alignments in deployed models can be tested and potentially bypassed at scale without requiring expert prompt engineering for each test case.

Core claim

GPTFuzz starts with human-written jailbreak templates as seeds, selects them for mutation to balance efficiency and diversity, applies operators that rewrite sentences while preserving intent, and feeds the results to a judgment model that labels whether the target LLM produced the requested harmful content. On this basis the framework reports attack success rates exceeding 90 percent on ChatGPT and Llama-2 across multiple scenarios, outperforming the original hand-crafted seeds.

What carries the argument

The fuzzing loop that repeatedly selects seeds, applies semantic-preserving mutation operators, and filters results through a judgment model that scores whether harmful content was elicited.

If this is right

Red-teaming large language models can be scaled beyond the limits of manual prompt design.
Safety evaluations of commercial and open-source models can be automated and repeated with diverse prompt variations.
Even models that resist the original human-written templates remain vulnerable to small rephrasings produced by mutation.
Jailbreak success can be achieved without access to model weights or internal states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLM safety training appears sensitive to prompt phrasing variations that preserve meaning, suggesting alignment may be shallower than intended.
The same mutation-based generation approach could be applied to discover other failure modes such as factual hallucination or bias amplification.
Developers of LLMs may need to incorporate automated red-teaming loops into their training or evaluation pipelines to close these gaps.
Over-reliance on any single judgment model risks inflating perceived vulnerability unless cross-validated against human raters.

Load-bearing premise

The separate judgment model correctly labels whether each generated prompt actually succeeded in eliciting harmful output from the target LLM.

What would settle it

Manual inspection of several hundred GPTFuzz outputs and target responses that shows the judgment model misclassifies at least 15 percent of cases as successes when the LLM actually refused or gave safe answers.

read the original abstract

Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTFuzzer adapts AFL-style fuzzing to auto-generate jailbreak templates and reports high success rates, but those rates rest on an unvalidated LLM judge.

read the letter

The main thing to know is that this paper takes the AFL fuzzing approach and applies it to generating jailbreak prompts for LLMs. It starts with a few human-written seed templates, mutates them using operators that preserve meaning, picks which ones to expand next, and uses a separate LLM to judge whether an attack succeeded. The reported outcome is attack success rates above 90 percent on ChatGPT and Llama-2, even when the starting seeds are weak, and the method beats the hand-crafted templates they compare against.

Referee Report

2 major / 2 minor

Summary. The paper presents GPTFuzz, a black-box fuzzing framework inspired by AFL that automates jailbreak template generation for red-teaming LLMs. It starts with human-written seed templates, applies mutation operators to create variants, uses a seed selection strategy, and employs an LLM-based judgment model to evaluate attack success. Experiments on ChatGPT, LLaMA-2, and Vicuna report attack success rates exceeding 90% even from suboptimal seeds, outperforming manual templates.

Significance. If the results are robust, the work supplies a practical, scalable tool for automated red-teaming that could accelerate safety evaluations of LLMs and motivate stronger alignment techniques. The empirical demonstration across commercial and open-source models, combined with the mutation-based generation approach, offers a concrete advance over purely manual jailbreak crafting.

major comments (2)

[Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.
[Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.

minor comments (2)

[Abstract] The abstract states that GPTFuzz surpasses human-crafted templates but does not quantify the human baseline ASR or describe the exact set of manual templates used for comparison.
[Method] Notation for the three mutation operators and the seed selection scoring function could be introduced more formally with pseudocode or equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation and judgment model description] The central >90% ASR claim (abstract and evaluation results) depends on the judgment model correctly classifying jailbreak success, yet no calibration against human labels, false-positive/negative rates, or inter-annotator agreement statistics are reported. This is load-bearing because any systematic bias in the judge directly inflates the headline numbers.

Authors: We agree that the accuracy and potential bias of the LLM-based judgment model are critical to the validity of the reported attack success rates. The original manuscript described the judgment model but did not include calibration details. In the revised version, we will add a dedicated subsection that reports a calibration study: a random sample of responses will be independently labeled by multiple human annotators, inter-annotator agreement (Cohen's kappa) will be computed, and false-positive and false-negative rates of the judge relative to human consensus will be presented. This will directly address concerns about systematic bias. revision: yes
Referee: [Experiments and results] The experimental section provides no baseline comparisons against other automated jailbreak generation methods, no statistical significance tests on the reported success rates, and insufficient controls for response variability across multiple runs or temperature settings.

Authors: We acknowledge the absence of these controls and comparisons. In the revision we will add (1) direct comparisons against other automated jailbreak generation methods that were available at the time of submission, (2) results from multiple independent runs of GPTFuzz at different temperature settings, reporting means and standard deviations, and (3) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) comparing GPTFuzz against both human-crafted templates and the added baselines. These additions will demonstrate robustness beyond single-run results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper presents GPTFuzz as an empirical black-box fuzzing framework: human-written seed templates are mutated via operators and evaluated for jailbreak success using a separate judgment model against target LLMs (ChatGPT, LLaMa-2, Vicuna). No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Results (e.g., >90% ASR) are reported from direct evaluations on external models rather than reducing to inputs by construction. The framework relies on external APIs and human seeds, with no load-bearing self-citations or uniqueness claims that collapse the central result. This is a standard empirical methodology without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework relies on standard fuzzing concepts and LLM capabilities without introducing new free parameters, axioms, or invented entities beyond the described components.

pith-pipeline@v0.9.0 · 5607 in / 955 out tokens · 39063 ms · 2026-05-15T06:20:52.596251+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
cs.CR 2026-05 unverdicted novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 7.0

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
On the Hardness of Junking LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 7.0

Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
Adaptive Instruction Composition for Automated LLM Red-Teaming
cs.CR 2026-04 unverdicted novelty 7.0

Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
LLM-Agnostic Semantic Representation Attack
cs.CL 2026-05 unverdicted novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI
cs.HC 2026-05 unverdicted novelty 6.0

PersonaTeaming Workflow improves automated red-teaming attack success rates over RainbowPlus using personas while maintaining diversity, and PersonaTeaming Playground supports human-AI collaboration in red-teaming as ...
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 6.0

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
cs.CL 2026-04 unverdicted novelty 6.0

HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization signi...
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
cs.CR 2026-04 unverdicted novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruct...
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation
cs.AI 2026-04 unverdicted novelty 6.0

CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
PIArena: A Platform for Prompt Injection Evaluation
cs.CR 2026-04 unverdicted novelty 5.0

PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 17 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Introducing claude

Anthropic. Introducing claude. https://www.anthropic.com/ index/introducing-claude. Accessed on 08/08/2023

work page 2023
[3]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235– 256, 2002

work page 2002
[4]

Efficient greybox fuzzing to detect memory errors

Jinsheng Ba, Gregory J Duck, and Abhik Roychoudhury. Efficient greybox fuzzing to detect memory errors. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineer- ing, pages 1–12, 2022

work page 2022
[5]

Spinning language models: Risks of propaganda-as-a-service and countermeasures

Eugene Bagdasaryan and Vitaly Shmatikov. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In Proc. of IEEE Symposium on Security and Privacy (SP). IEEE, 2022

work page 2022
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom 16 Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Baichuan-13b

Baichuan-Inc. Baichuan-13b. https://github.com/ baichuan-inc/Baichuan-13B. Accessed on 08/08/2023

work page 2023
[8]

A computer wrote this paper: What chatgpt means for education, research, and writing

Lea Bishop. A computer wrote this paper: What chatgpt means for education, research, and writing. Research, and Writing (January 26, 2023), 2023

work page 2023
[9]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

work page 1901
[10]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

work page arXiv 2023
[12]

Chatgpt and other artificial intelligence applications speed up scientific writing

Tzeng-Ji Chen. Chatgpt and other artificial intelligence applications speed up scientific writing. Journal of the Chinese Medical Association, 86(4):351–353, 2023

work page 2023
[13]

Efficient selectivity and backup operators in monte- carlo tree search

Rémi Coulom. Efficient selectivity and backup operators in monte- carlo tree search. Computers and games, 4630:72–83, 2006

work page 2006
[14]

Jailbreaker: Automated jailbreak across multiple large language model chatbots

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023

work page arXiv 2023
[15]

Toxicity in chatgpt: Analyzing persona-assigned language models

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023

work page arXiv 2023
[16]

Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage

TheAnh Do, Alvis Cheuk M Fong, and Russel Pears. Dynamic sym- bolic execution guided by data dependency analysis for high structural coverage. In Evaluation of Novel Approaches to Software Engineer- ing: 7th International Conference, ENASE 2012, Warsaw, Poland, June 29-30, 2012, Revised Selected Papers 7, pages 3–15. Springer, 2013

work page 2012
[17]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021

work page arXiv 2021
[18]

Snipuzz: Black-box fuzzing of iot firmware via message snippet inference

Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue, Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. Snipuzz: Black-box fuzzing of iot firmware via message snippet inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 337–350, 2021

work page 2021
[19]

Libafl: A framework to build modular and reusable fuzzers

Andrea Fioraldi, Dominik Christian Maier, Dongjia Zhang, and Davide Balzarotti. Libafl: A framework to build modular and reusable fuzzers. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1051–1065, 2022

work page 2022
[20]

Dyta: dynamic symbolic execution guided with static verification results

Xi Ge, Kunal Taneja, Tao Xie, and Nikolai Tillmann. Dyta: dynamic symbolic execution guided with static verification results. In Proceed- ings of the 33rd International Conference on Software Engineering , pages 992–994, 2011

work page 2011
[21]

Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models. arXiv preprint arXiv:2009.11462, 2020

work page arXiv 2009
[22]

Google. Bard. https://bard.google.com/. Accessed on 08/08/2023

work page 2023
[23]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Com- promising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

On cal- ibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On cal- ibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017

work page 2017
[25]

Seed selection for success- ful fuzzing

Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L Hosking. Seed selection for success- ful fuzzing. In Proceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis, pages 230–243, 2021

work page 2021
[26]

Balance seed scheduling via monte carlo planning

Heqing Huang, Hung-Chun Chiu, Qingkai Shi, Peisen Yao, and Charles Zhang. Balance seed scheduling via monte carlo planning. IEEE Transactions on Dependable and Secure Computing, 2023

work page 2023
[27]

Diar: Removing un- interesting bytes from seeds in software fuzzing

Aftab Hussain and Mohammad Amin Alipour. Diar: Removing un- interesting bytes from seeds in software fuzzing. arXiv preprint arXiv:2112.13297, 2021

work page arXiv 2021
[28]

Smart seed selection-based effective black box fuzzing for iiot protocol

SungJin Kim, Jaeik Cho, Changhoon Lee, and Taeshik Shon. Smart seed selection-based effective black box fuzzing for iiot protocol. The Journal of Supercomputing, 76:10140–10154, 2020

work page 2020
[29]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Learning seed-adaptive mutation strategies for greybox fuzzing

Myungho Lee, Sooyoung Cha, and Hakjoo Oh. Learning seed-adaptive mutation strategies for greybox fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 384–

work page 2023
[31]

Multi-step jailbreaking privacy attacks on chatgpt

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023

work page arXiv 2023
[32]

Multi-target backdoor attacks for code pre-trained models

Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. Multi-target backdoor attacks for code pre-trained models. arXiv preprint arXiv:2306.08350, 2023

work page arXiv 2023
[33]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Mea- suring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Adversarial training for large neural language models

Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020

work page arXiv 2004
[35]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt in- jection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Jailbreaking chat- gpt via prompt engineering: An empirical study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chat- gpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023

work page arXiv 2023
[38]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoy- anov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[39]

Directed symbolic execution

Kin-Keung Ma, Khoo Yit Phang, Jeffrey S Foster, and Michael Hicks. Directed symbolic execution. In Static Analysis: 18th International Symposium, SAS 2011, Venice, Italy, September 14-16, 2011. Proceed- ings 18, pages 95–111. Springer, 2011

work page 2011
[40]

A holistic approach to undesired content detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloun- dou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018, 2023

work page 2023
[41]

No- table: Transferable backdoor attacks against prompt-based nlp models

Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. No- table: Transferable backdoor attacks against prompt-based nlp models. arXiv preprint arXiv:2305.17826, 2023. 17

work page arXiv 2023
[42]

An empirical study of the reliability of unix utilities

Barton P Miller, Lars Fredriksen, and Bryan So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32– 44, 1990

work page 1990
[43]

Evaluating the robustness of neural language models to input perturbations

Milad Moradi and Matthias Samwald. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237, 2021

work page arXiv 2021
[44]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt,

work page
[45]

Accessed: 08/08/2023

work page 2023
[46]

Forecasting potential misuses of language models for disin- formation campaigns and how to reduce risk

OpenAI. Forecasting potential misuses of language models for disin- formation campaigns and how to reduce risk. https://openai.com/ research/forecasting-misuse, 2023. Accessed: 08/08/2023

work page 2023
[47]

Function calling and other api updates

OpenAI. Function calling and other api updates. https://openai. com/blog/function-calling-and-other-api-updates , 2023. Accessed: 08/08/2023

work page 2023
[48]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[50]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page internal anchor Pith review arXiv 2023
[52]

Drifuzz: Har- vesting bugs in device drivers from golden seeds

Zekun Shen, Ritik Roongta, and Brendan Dolan-Gavitt. Drifuzz: Har- vesting bugs in device drivers from golden seeds. In 31st USENIX Security Symposium (USENIX Security 22), pages 1275–1290, 2022

work page 2022
[53]

Process for adapting language models to society (palms) with values-targeted datasets

Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021

work page 2021
[54]

Safety assessment of chinese large language models

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023

work page arXiv 2023
[55]

Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open- source, commercially usable llms, 2023. Accessed: 08/08/2023

work page 2023
[56]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injec- tion vulnerabilities

Erik Trickel, Fabio Pagani, Chang Zhu, Lukas Dresel, Giovanni Vigna, Christopher Kruegel, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, and Adam Doupé. Toss a fault to your witcher: Applying grey-box coverage-guided mutational fuzzing to detect sql and command injec- tion vulnerabilities. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2658–267...

work page 2023
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[59]

syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023

Dmitry Vyukov and the syzkaller contributors. syzkaller: unsuper- vised, coverage-guided kernel fuzzer.https://github.com/google/ syzkaller, 2023. Accessed: 08/08/2023

work page 2023
[60]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023

work page arXiv 2023
[61]

{SyzVegas}: Beating kernel fuzzing odds with reinforcement learning

Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V Krishnamurthy, and Nael Abu-Ghazaleh. {SyzVegas}: Beating kernel fuzzing odds with reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21), pages 2741–2758, 2021

work page 2021
[62]

Is chatgpt a good nlg evaluator? a preliminary study

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023

work page arXiv 2023
[63]

Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing

Jinghan Wang, Chengyu Song, and Heng Yin. Reinforcement learning- based hierarchical seed scheduling for greybox fuzzing. 2021

work page 2021
[64]

Self-critique prompting with large language models for inductive instructions

Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, and Kam- Fai Wong. Self-critique prompting with large language models for inductive instructions. arXiv preprint arXiv:2305.13733, 2023

work page arXiv 2023
[65]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Singular- ity: Pattern fuzzing for worst case complexity

Jiayi Wei, Jia Chen, Yu Feng, Kostas Ferles, and Isil Dillig. Singular- ity: Pattern fuzzing for worst case complexity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Con- ference and Symposium on the Foundations of Software Engineering, pages 213–223, 2018

work page 2018
[67]

Challenges in detoxifying language models

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445, 2021

work page arXiv 2021
[68]

One fuzzing strategy to rule them all

Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering, pages 1634–1645, 2022

work page 2022
[69]

Cvalues: Measuring the values of chinese large language models from safety to responsibility

Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705, 2023

work page arXiv 2023
[70]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023

work page 2023
[71]

{EcoFuzz}: Adaptive {Energy-Saving} greybox fuzzing as a variant of the adversarial {Multi-Armed} bandit

Tai Yue, Pengfei Wang, Yong Tang, Enze Wang, Bo Yu, Kai Lu, and Xu Zhou. {EcoFuzz}: Adaptive {Energy-Saving} greybox fuzzing as a variant of the adversarial {Multi-Armed} bandit. In 29th USENIX Security Symposium (USENIX Security 20), pages 2307–2324, 2020

work page 2020
[72]

American fuzzy lop

Michał Zalewski. American fuzzy lop. http://lcamtuf.coredump. cx/afl/, 2023. Accessed: 08/08/2023

work page 2023
[73]

Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing

G Zhang, P Wang, T Yue, X Kong, S Huang, X Zhou, and K Lu. Mob- fuzz: Adaptive multi-objective optimization in gray-box fuzzing. In Network and Distributed Systems Security (NDSS) Symposium, volume 2022, 2022

work page 2022
[74]

Fine-mixing: Mitigating backdoors in fine-tuned language models

Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models. arXiv preprint arXiv:2210.09545, 2022

work page arXiv 2022
[75]

Evolutionary mutation-based fuzzing as monte carlo tree search

Yiru Zhao, Xiaoke Wang, Lei Zhao, Yueqiang Cheng, and Heng Yin. Evolutionary mutation-based fuzzing as monte carlo tree search. arXiv preprint arXiv:2101.00612, 2021

work page arXiv 2021
[76]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine- tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[78]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 18 A Datasets A.1 Jailbreak Templates As stated in Section 4.1, we sample 77 jailbreak templates from previous work [37], which are collected from online shared jailbreak templates.1 He...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

We can see that although both models can be jailbroken for all 100 questions, gpt-3.5-turbo- 0631 demonstrates enhanced robustness compared to gpt- 3.5-turbo-0301

and an earlier version released in March(gpt-3.5-turbo- 0301), as detailed in Table 6. We can see that although both models can be jailbroken for all 100 questions, gpt-3.5-turbo- 0631 demonstrates enhanced robustness compared to gpt- 3.5-turbo-0301. The average successful templates on gpt-3.5- turbo-0301 is 51.55, which is much higher than the latest ver...

work page