Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Pith reviewed 2026-05-18 02:08 UTC · model grok-4.3
The pith
A multi-turn escalation method called Crescendo jailbreaks aligned LLMs by starting benign and progressively referencing the model's own prior answers to reach disallowed requests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Crescendo is a multi-turn jailbreak attack that interacts with the model in a seemingly benign manner, beginning with a general prompt and then gradually escalating the dialogue by referencing the model's replies to lead to a successful jailbreak of the alignment.
What carries the argument
Crescendo attack, a gradual multi-turn escalation that references prior model outputs to build toward disallowed requests without early refusal.
If this is right
- LLM alignment techniques must account for conversational context and escalation across multiple turns.
- Crescendomation provides a more effective automated way to test model safety than previous single-turn methods.
- Success on multiple models suggests widespread vulnerability to this style of attack.
- Multimodal models share the same weakness in handling escalating text-based queries.
Where Pith is reading between the lines
- This implies that safety training on isolated prompts leaves models open to attacks that unfold over time.
- Future defenses could monitor the trajectory of conversation topics rather than just individual messages.
- Attack success might be reduced if models are trained to recognize and break patterns of self-referential escalation.
Load-bearing premise
The target models will continue to respond to follow-up requests that reference their own prior outputs without triggering refusal mechanisms early in the escalation.
What would settle it
A model that consistently refuses to engage or changes the subject when a user references its previous answer to a general question and then asks for more specific harmful details.
read the original abstract
Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Crescendo, a multi-turn jailbreak attack on aligned LLMs that begins with general benign prompts about a task and escalates the dialogue by progressively referencing the model's own prior responses to reach a successful jailbreak. It reports high attack success rates when evaluated on ChatGPT, Gemini Pro, Gemini Ultra, Llama-2 70b Chat, Llama-3 70b Chat, and Anthropic Chat. The authors also present Crescendomation, an automated implementation of the attack, which they show outperforms prior state-of-the-art jailbreaking techniques on an AdvBench subset (29-61% higher on GPT-4 and 49-71% higher on Gemini-Pro). The work additionally demonstrates the attack on multimodal models.
Significance. If the empirical results hold, the work is significant because it identifies a simple, conversational escalation pattern that exploits the absence of robust cross-turn refusal mechanisms in current alignment pipelines. The multi-model evaluation and direct comparison against existing techniques on a standard benchmark provide concrete evidence that multi-turn interactions remain an under-defended attack surface. The automation tool further increases the practical utility of the contribution for red-teaming and safety research.
major comments (2)
- [§4] §4 (Evaluation): The manuscript reports aggregate attack success rates for Crescendo across models and tasks but supplies no per-turn refusal statistics, no refusal-rate curves, and no breakdown of how often intermediate 'seemingly benign' queries still trigger safety refusals. Because the central claim rests on the model's willingness to sustain engagement across the escalation chain without early termination, the absence of these data leaves the reliability of the multi-turn process unverified.
- [§5] §5 (Crescendomation): The description of the automated tool does not specify the exact escalation heuristics, refusal-detection criteria, or prompt templates used to achieve the reported 29-61% and 49-71% gains over baselines on the AdvBench subset. Without these details the performance claims cannot be independently reproduced or stress-tested against the same refusal behavior that the manual Crescendo attack presupposes.
minor comments (1)
- [Abstract] The abstract states that Crescendo achieves 'high attack success rates' on the non-automated evaluations but does not provide the corresponding numerical values; adding them would make the abstract consistent with the quantitative claims later given for Crescendomation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. The comments identify opportunities to strengthen the empirical presentation and reproducibility of the Crescendo and Crescendomation results. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation): The manuscript reports aggregate attack success rates for Crescendo across models and tasks but supplies no per-turn refusal statistics, no refusal-rate curves, and no breakdown of how often intermediate 'seemingly benign' queries still trigger safety refusals. Because the central claim rests on the model's willingness to sustain engagement across the escalation chain without early termination, the absence of these data leaves the reliability of the multi-turn process unverified.
Authors: We agree that per-turn refusal statistics and refusal-rate curves would provide stronger verification that models sustain engagement through the escalation steps. In the revised manuscript we will add these analyses to Section 4, including refusal rates for intermediate queries and curves that track successful escalation versus refusal across turns for representative models and tasks. The data will be derived from our existing evaluation logs. revision: yes
-
Referee: [§5] §5 (Crescendomation): The description of the automated tool does not specify the exact escalation heuristics, refusal-detection criteria, or prompt templates used to achieve the reported 29-61% and 49-71% gains over baselines on the AdvBench subset. Without these details the performance claims cannot be independently reproduced or stress-tested against the same refusal behavior that the manual Crescendo attack presupposes.
Authors: We acknowledge that greater specificity is required for independent reproduction. In the revised Section 5 we will include the escalation heuristics, the refusal-detection criteria (keyword-based and model-assisted checks), and representative prompt templates used by Crescendomation. These additions will enable verification of the reported performance gains while keeping the tool description at a level appropriate for ongoing red-teaming use. revision: yes
Circularity Check
No circularity: purely empirical attack evaluation
full rationale
The paper describes and empirically tests a multi-turn jailbreak technique called Crescendo on external LLMs (GPT-4, Gemini, Llama, etc.) with reported attack success rates. It contains no equations, fitted parameters, derivations, or self-citations that serve as load-bearing premises for the central claims. All results derive from direct interaction testing against independent models rather than any self-referential construction or renaming of inputs as outputs. The work is therefore self-contained with no reduction of claims to their own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
-
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
-
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection an...
-
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
-
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
-
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
-
Generalization Limits of Reinforcement Learning Alignment
Compound jailbreaks raise attack success on aligned LLMs from 14.3% to 71.4%, providing evidence that safety training generalizes less broadly than model capabilities.
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
-
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
-
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
-
Robust AI Security and Alignment: A Sisyphean Endeavor?
AI security and alignment cannot achieve full robustness because any sufficiently powerful AI inherits incompleteness-style limitations from formal systems.
Reference graph
Works this paper leans on
-
[1]
https://chat.openai.com/share/ 31708d66-c735-46e4-94fd-41f436d4d3e9
-
[2]
https://gemini.google.com/share/ 35f0817c3a03
-
[3]
https://www.lesswrong.com/ posts/RYcoJdvmoBbi5Nax7/ jailbreaking-chatgpt-on-release-day
-
[4]
https://www.jailbreakchat.com/
-
[5]
https://learn.microsoft.com/ en-us/python/api/overview/azure/ ai-contentsafety-readme?view=azure-python
-
[6]
https://perspectiveapi.com/
-
[7]
Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreak- ing
-
[8]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK- innon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse,...
work page 2022
-
[9]
Extracting training data from large language models, 2021
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021
work page 2021
-
[10]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jail- breaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Yixin Cheng, Markos Georgopoulos, V olkan Cevher, and Grigorios G. Chrysos. Leveraging the context through multi-round interactions for jailbreaking attacks. 2024
work page 2024
-
[12]
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023. 15
-
[13]
Multilingual Jailbreak Challenges in Large Language Models
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. Multilingual Jailbreak Challenges in Large Language Models. CoRR abs/2310.06474, 2023
-
[14]
Mart: Improving llm safety with multi-round automatic red-teaming, 2023
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi- Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. Mart: Improving llm safety with multi-round automatic red-teaming, 2023
work page 2023
-
[15]
Improving alignment of dialogue agents via targeted human judgements, 2022
Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soˇna Mokrá, Nic...
work page 2022
-
[16]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open- source LLMs via Exploiting Generation. CoRR abs/2310.06987, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter Kairouz, Christopher A. Choquette-Choo, and Zheng Xu. User inference attacks on large language models, 2023
work page 2023
-
[18]
Ethi- cal frameworks and computer security trolley problems: Foundations for conversations
Tadayoshi Kohno, Yasemin Acar, and Wulf Loh. Ethi- cal frameworks and computer security trolley problems: Foundations for conversations. In 32nd USENIX Se- curity Symposium (USENIX Security 23) , pages 5145– 5162, Anaheim, CA, August 2023. USENIX Associa- tion
work page 2023
-
[19]
Buckley, Jason Phang, Samuel R
Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, and Ethan Perez. Pretraining lan- guage models with human preferences, 2023
work page 2023
-
[20]
Multi-step jailbreaking privacy attacks on chatgpt, 2023
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt, 2023
work page 2023
-
[21]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study. CoRR abs/2305.13860, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Ana- lyzing leakage of personally identifiable information in language models, 2023
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Ana- lyzing leakage of personally identifiable information in language models, 2023
work page 2023
-
[24]
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024
work page 2024
-
[25]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training lan- guage models to follow instructions with human feed-...
work page 2022
-
[26]
Maat- phor: Automated variant analysis for prompt injection attacks, 2023
Ahmed Salem, Andrew Paverd, and Boris Köpf. Maat- phor: Automated variant analysis for prompt injection attacks, 2023
work page 2023
-
[27]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail?CoRR abs/2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Rui Wen, Tianhao Wang, Michael Backes, Yang Zhang, and Ahmed Salem. Last one standing: A comparative analysis of security and privacy of soft prompt tuning, lora, and in-context learning, 2023
work page 2023
-
[30]
Defending chatgpt against jailbreak attack via self- reminders
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self- reminders. Nature Machine Intelligence, 5:1486–1496, 2023
work page 2023
-
[31]
Liang Xu, Kangkang Zhao, Lei Zhu, and Hang Xue. Sc- safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese, 2023
work page 2023
-
[32]
Chain of attack: a semantic-driven contextual multi-turn attacker for llm, 2024
Xikang Yang, Xuehai Tang, Songlin Hu, and Jizhong Han. Chain of attack: a semantic-driven contextual multi-turn attacker for llm, 2024
work page 2024
-
[33]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. GPTFUZZER: Red Teaming Large Language Mod- els with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023. 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Defending large language models against jailbreaking attacks through goal priori- tization, 2024
Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal priori- tization, 2024
work page 2024
-
[35]
Make them spill the beans! coercive knowledge extraction from (production) llms, 2023
Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make them spill the beans! coercive knowledge extraction from (production) llms, 2023
work page 2023
-
[36]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrik- son. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023. A Evaluation We present here (Table 6 and Table 7) the maximum scores achieved for all models. Figure 15 presents an example for Manifesto with profanity. And Figure 16 compares the model refusa...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.