arxiv: 2404.01833 · v3 · pith:TLBEXMBVnew · submitted 2024-04-02 · 💻 cs.CR · cs.AI

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Mark Russinovich , Ahmed Salem , Ronen Eldan This is my paper

Pith reviewed 2026-05-18 02:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attackLLM safetymulti-turn attackCrescendoCrescendomationadversarial promptingAI alignmentlarge language models

0 comments

The pith

A multi-turn escalation method called Crescendo jailbreaks aligned LLMs by starting benign and progressively referencing the model's own prior answers to reach disallowed requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Crescendo as a new jailbreak attack that starts with innocuous questions about a topic and then escalates the request over multiple turns by referencing what the model has already said. This method avoids triggering refusals early because each step seems like a natural follow-up. When tested on popular models including GPT, Gemini, Llama, and Claude, it achieves high success in getting the models to produce harmful or prohibited content. An automated implementation called Crescendomation even outperforms other known jailbreak methods on standard benchmarks for GPT-4 and Gemini. The same approach works on models that handle both text and images.

Core claim

Crescendo is a multi-turn jailbreak attack that interacts with the model in a seemingly benign manner, beginning with a general prompt and then gradually escalating the dialogue by referencing the model's replies to lead to a successful jailbreak of the alignment.

What carries the argument

Crescendo attack, a gradual multi-turn escalation that references prior model outputs to build toward disallowed requests without early refusal.

If this is right

LLM alignment techniques must account for conversational context and escalation across multiple turns.
Crescendomation provides a more effective automated way to test model safety than previous single-turn methods.
Success on multiple models suggests widespread vulnerability to this style of attack.
Multimodal models share the same weakness in handling escalating text-based queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This implies that safety training on isolated prompts leaves models open to attacks that unfold over time.
Future defenses could monitor the trajectory of conversation topics rather than just individual messages.
Attack success might be reduced if models are trained to recognize and break patterns of self-referential escalation.

Load-bearing premise

The target models will continue to respond to follow-up requests that reference their own prior outputs without triggering refusal mechanisms early in the escalation.

What would settle it

A model that consistently refuses to engage or changes the subject when a user references its previous answer to a general question and then asks for more specific harmful details.

read the original abstract

Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Crescendo, a multi-turn jailbreak attack on aligned LLMs that begins with general benign prompts about a task and escalates the dialogue by progressively referencing the model's own prior responses to reach a successful jailbreak. It reports high attack success rates when evaluated on ChatGPT, Gemini Pro, Gemini Ultra, Llama-2 70b Chat, Llama-3 70b Chat, and Anthropic Chat. The authors also present Crescendomation, an automated implementation of the attack, which they show outperforms prior state-of-the-art jailbreaking techniques on an AdvBench subset (29-61% higher on GPT-4 and 49-71% higher on Gemini-Pro). The work additionally demonstrates the attack on multimodal models.

Significance. If the empirical results hold, the work is significant because it identifies a simple, conversational escalation pattern that exploits the absence of robust cross-turn refusal mechanisms in current alignment pipelines. The multi-model evaluation and direct comparison against existing techniques on a standard benchmark provide concrete evidence that multi-turn interactions remain an under-defended attack surface. The automation tool further increases the practical utility of the contribution for red-teaming and safety research.

major comments (2)

[§4] §4 (Evaluation): The manuscript reports aggregate attack success rates for Crescendo across models and tasks but supplies no per-turn refusal statistics, no refusal-rate curves, and no breakdown of how often intermediate 'seemingly benign' queries still trigger safety refusals. Because the central claim rests on the model's willingness to sustain engagement across the escalation chain without early termination, the absence of these data leaves the reliability of the multi-turn process unverified.
[§5] §5 (Crescendomation): The description of the automated tool does not specify the exact escalation heuristics, refusal-detection criteria, or prompt templates used to achieve the reported 29-61% and 49-71% gains over baselines on the AdvBench subset. Without these details the performance claims cannot be independently reproduced or stress-tested against the same refusal behavior that the manual Crescendo attack presupposes.

minor comments (1)

[Abstract] The abstract states that Crescendo achieves 'high attack success rates' on the non-automated evaluations but does not provide the corresponding numerical values; adding them would make the abstract consistent with the quantitative claims later given for Crescendomation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments identify opportunities to strengthen the empirical presentation and reproducibility of the Crescendo and Crescendomation results. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Evaluation): The manuscript reports aggregate attack success rates for Crescendo across models and tasks but supplies no per-turn refusal statistics, no refusal-rate curves, and no breakdown of how often intermediate 'seemingly benign' queries still trigger safety refusals. Because the central claim rests on the model's willingness to sustain engagement across the escalation chain without early termination, the absence of these data leaves the reliability of the multi-turn process unverified.

Authors: We agree that per-turn refusal statistics and refusal-rate curves would provide stronger verification that models sustain engagement through the escalation steps. In the revised manuscript we will add these analyses to Section 4, including refusal rates for intermediate queries and curves that track successful escalation versus refusal across turns for representative models and tasks. The data will be derived from our existing evaluation logs. revision: yes
Referee: [§5] §5 (Crescendomation): The description of the automated tool does not specify the exact escalation heuristics, refusal-detection criteria, or prompt templates used to achieve the reported 29-61% and 49-71% gains over baselines on the AdvBench subset. Without these details the performance claims cannot be independently reproduced or stress-tested against the same refusal behavior that the manual Crescendo attack presupposes.

Authors: We acknowledge that greater specificity is required for independent reproduction. In the revised Section 5 we will include the escalation heuristics, the refusal-detection criteria (keyword-based and model-assisted checks), and representative prompt templates used by Crescendomation. These additions will enable verification of the reported performance gains while keeping the tool description at a level appropriate for ongoing red-teaming use. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation

full rationale

The paper describes and empirically tests a multi-turn jailbreak technique called Crescendo on external LLMs (GPT-4, Gemini, Llama, etc.) with reported attack success rates. It contains no equations, fitted parameters, derivations, or self-citations that serve as load-bearing premises for the central claims. All results derive from direct interaction testing against independent models rather than any self-referential construction or renaming of inputs as outputs. The work is therefore self-contained with no reduction of claims to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution rests on an empirical attack strategy and direct testing; no free parameters, mathematical axioms, or newly postulated entities are introduced.

pith-pipeline@v0.9.0 · 5839 in / 1166 out tokens · 36968 ms · 2026-05-18T02:08:41.583478+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
cs.CR 2026-05 unverdicted novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
cs.CL 2026-05 unverdicted novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
cs.CR 2026-04 conditional novelty 7.0

Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection an...
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
cs.CR 2026-05 conditional novelty 6.0

ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
cs.CR 2026-04 conditional novelty 6.0

Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
cs.SE 2025-12 unverdicted novelty 6.0

RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
cs.CR 2025-12 unverdicted novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
Generalization Limits of Reinforcement Learning Alignment
cs.LG 2026-04 unverdicted novelty 5.0

Compound jailbreaks raise attack success on aligned LLMs from 14.3% to 71.4%, providing evidence that safety training generalizes less broadly than model capabilities.
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
cs.CR 2025-11 unverdicted novelty 5.0

ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
cs.AI 2024-11 unverdicted novelty 5.0

Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
Robust AI Security and Alignment: A Sisyphean Endeavor?
cs.AI 2025-12 unverdicted novelty 3.0

AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 17 Pith papers · 8 internal anchors

[1]

https://chat.openai.com/share/ 31708d66-c735-46e4-94fd-41f436d4d3e9

work page
[2]

https://gemini.google.com/share/ 35f0817c3a03

work page
[3]

https://www.lesswrong.com/ posts/RYcoJdvmoBbi5Nax7/ jailbreaking-chatgpt-on-release-day

work page
[4]

https://www.jailbreakchat.com/

work page
[5]

https://learn.microsoft.com/ en-us/python/api/overview/azure/ ai-contentsafety-readme?view=azure-python

work page
[6]

https://perspectiveapi.com/

work page
[7]

Many-shot jailbreak- ing

Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreak- ing

work page
[8]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK- innon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...

work page 2022
[9]

Extracting training data from large language models, 2021

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021

work page 2021
[10]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jail- breaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Yixin Cheng, Markos Georgopoulos, V olkan Cevher, and Grigorios G. Chrysos. Leveraging the context through multi-round interactions for jailbreaking attacks. 2024

work page 2024
[12]

Jailbreaker: Automated jailbreak across mul- tiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023. 15

work page arXiv 2023
[13]

Multilingual Jailbreak Challenges in Large Language Models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. Multilingual Jailbreak Challenges in Large Language Models. CoRR abs/2310.06474, 2023

work page arXiv 2023
[14]

Mart: Improving llm safety with multi-round automatic red-teaming, 2023

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming, 2023

work page 2023
[15]

Improving alignment of dialogue agents via targeted human judgements, 2022

Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soˇna Mokrá, Nic...

work page 2022
[16]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open- source LLMs via Exploiting Generation. CoRR abs/2310.06987, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Choquette-Choo, and Zheng Xu

Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter Kairouz, Christopher A. Choquette-Choo, and Zheng Xu. User inference attacks on large language models, 2023

work page 2023
[18]

Ethi- cal frameworks and computer security trolley problems: Foundations for conversations

Tadayoshi Kohno, Yasemin Acar, and Wulf Loh. Ethi- cal frameworks and computer security trolley problems: Foundations for conversations. In 32nd USENIX Se- curity Symposium (USENIX Security 23) , pages 5145– 5162, Anaheim, CA, August 2023. USENIX Associa- tion

work page 2023
[19]

Buckley, Jason Phang, Samuel R

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining lan- guage models with human preferences, 2023

work page 2023
[20]

Multi-step jailbreaking privacy attacks on chatgpt, 2023

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt, 2023

work page 2023
[21]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study. CoRR abs/2305.13860, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Ana- lyzing leakage of personally identifiable information in language models, 2023

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Ana- lyzing leakage of personally identifiable information in language models, 2023

work page 2023
[24]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

work page 2024
[25]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training lan- guage models to follow instructions with human feed-...

work page 2022
[26]

Maat- phor: Automated variant analysis for prompt injection attacks, 2023

Ahmed Salem, Andrew Paverd, and Boris Köpf. Maat- phor: Automated variant analysis for prompt injection attacks, 2023

work page 2023
[27]

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail?CoRR abs/2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Last one standing: A comparative analysis of security and privacy of soft prompt tuning, lora, and in-context learning, 2023

Rui Wen, Tianhao Wang, Michael Backes, Yang Zhang, and Ahmed Salem. Last one standing: A comparative analysis of security and privacy of soft prompt tuning, lora, and in-context learning, 2023

work page 2023
[30]

Defending chatgpt against jailbreak attack via self- reminders

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self- reminders. Nature Machine Intelligence, 5:1486–1496, 2023

work page 2023
[31]

Sc- safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese, 2023

Liang Xu, Kangkang Zhao, Lei Zhu, and Hang Xue. Sc- safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese, 2023

work page 2023
[32]

Chain of attack: a semantic-driven contextual multi-turn attacker for llm, 2024

Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn attacker for llm, 2024

work page 2024
[33]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. GPTFUZZER: Red Teaming Large Language Mod- els with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Defending large language models against jailbreaking attacks through goal priori- tization, 2024

Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal priori- tization, 2024

work page 2024
[35]

Make them spill the beans! coercive knowledge extraction from (production) llms, 2023

Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make them spill the beans! coercive knowledge extraction from (production) llms, 2023

work page 2023
[36]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrik- son. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023. A Evaluation We present here (Table 6 and Table 7) the maximum scores achieved for all models. Figure 15 presents an example for Manifesto with profanity. And Figure 16 compares the model refusa...

work page internal anchor Pith review Pith/arXiv arXiv 2023