arxiv: 2311.03191 · v5 · pith:5FX652FTnew · submitted 2023-11-06 · 💻 cs.LG · cs.CR

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Xuan Li , Zhanke Zhou , Jianing Zhu , Jiangchao Yao , Tongliang Liu , Bo Han This is my paper

Pith reviewed 2026-05-19 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords jailbreakinglarge language modelssafety alignmentadversarial attackspersonificationnested scenariosrole-playingharmful content generation

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5FX652FT}

Prints a linked pith:5FX652FT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Nested role-play scenes let large language models generate harmful content and keep doing so in later exchanges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a lightweight technique that gets LLMs to imagine themselves inside a layered fictional scenario built from personification. This construction lets the model treat harmful requests as part of an inner story rather than direct instructions that violate its rules. The resulting outputs show higher rates of harmful content than earlier methods and the effect carries over into follow-up messages without new prompting. The same pattern appears across both open-source models such as Llama-2 and Llama-3 and closed-source ones such as the GPT series.

Core claim

By constructing a virtual nested scene through personification, the method enables LLMs to adaptively escape usage controls in normal scenarios, achieving leading harmfulness rates and sustaining jailbreaks in subsequent interactions on models including Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o.

What carries the argument

The virtual nested scene, a layered imaginary setting that the LLM role-plays within to bypass safety guardrails.

If this is right

The same self-losing weakness appears in both open-source models like Llama-2 and Llama-3 and closed-source models like GPT-3.5, GPT-4, and GPT-4o.
The induced contents reach higher harmfulness rates than previous jailbreak approaches.
Once the nested scene is established, harmful outputs continue in later interactions without repeated prompting.
The method requires only ordinary prompting and avoids high-cost computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current safety filters may overlook layered fictional contexts because they focus on direct harmful phrasing.
Training models to identify and reject instructions that ask for nested role-play could reduce this type of bypass.
The approach may transfer to other AI systems that allow creative or persona-based interactions.

Load-bearing premise

Large language models have exploitable personification capabilities that let them build a virtual nested scene to escape safety controls without triggering detection.

What would settle it

Applying the method to a model whose safety training explicitly flags and refuses nested fictional role-play would produce no harmful outputs or lose the effect after the first exchange.

read the original abstract

Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepInception gives a straightforward nested-scene prompt that hits high harm rates on GPT-4 and Llama models, but the continuous-jailbreak claim likely rides on retained chat context rather than a deeper safety break.

read the letter

The main thing to know is that this paper describes a lightweight prompting method called DeepInception that builds nested virtual scenes to push LLMs into generating harmful content, and it reports strong results across both open and closed models with code released for checking. The continuous part of the claim, though, probably comes down to keeping the conversation history rather than creating a permanent escape from safety filters.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeepInception, a lightweight prompting technique inspired by the Milgram experiment's authority influence. By leveraging LLMs' personification capabilities to build a virtual nested scene, the method aims to bypass safety guardrails adaptively. The authors report that this approach achieves leading harmfulness rates compared to prior methods and enables continuous jailbreaks in follow-up interactions across both open-source (Llama-2, Llama-3) and closed-source (GPT-3.5, GPT-4, GPT-4o) models. Code and data are made available.

Significance. If validated, the work identifies a fundamental 'self-losing' vulnerability in LLM safety alignments and offers an efficient, non-computational alternative to existing jailbreak techniques. The public release of code enhances potential for reproducibility and further research into LLM robustness.

major comments (2)

[Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.
[Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.

minor comments (2)

[Abstract] Abstract: the phrase 'self-losing' is used without a formal definition; a precise characterization of this weakness would aid interpretation.
[Method] Method description: an explicit example of the nested-scene prompt or pseudocode would improve reproducibility beyond the GitHub link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and have revised the manuscript to improve clarity and completeness where the feedback identifies gaps.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full results section reports harmfulness rates across models (Llama-2/3, GPT-3.5/4, GPT-4o) with comparisons to prior methods such as GCG and AutoDAN, along with variability across runs. We have revised the abstract to explicitly state representative harmfulness rates (e.g., over 80% on several models) and direct baseline comparisons. We have also added a brief note on evaluation consistency in the results section to address error analysis. revision: yes
Referee: [Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.

Authors: We thank the referee for highlighting the need for experimental clarification. Our continuous-jailbreak experiments were performed by issuing follow-up queries within the same conversation thread after the initial DeepInception prompt, allowing the nested virtual scene to sustain the jailbroken state. To address potential confusion with simple context retention, we have added explicit description of the protocol in the revised experimental setup: all reported continuous results use the same thread, while separate ablation tests in fresh sessions confirm the method's effectiveness without relying on prior context. This distinction is now stated in both the abstract and methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting method with no derivations or self-referential reductions

full rationale

The paper introduces DeepInception as a lightweight, Milgram-inspired prompting strategy to construct virtual nested scenes that exploit LLM personification for jailbreaking. No equations, parameters, or mathematical derivations appear in the provided text or abstract. The central claim rests on empirical harmfulness rates across Llama-2/3 and GPT models rather than any fitted input renamed as prediction or self-citation chain. The continuous-jailbreak observation is presented as an empirical outcome of retained context in interactions, without reduction to a self-defined quantity. This is a standard empirical contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs can be reliably induced into nested role-play without guardrail activation, plus the empirical observation that this produces continuous harmful output.

axioms (1)

domain assumption LLMs possess personification capabilities that can be leveraged to construct virtual nested scenes bypassing safety controls.
Invoked when describing the Milgram-inspired method and adaptive escape from usage control.

invented entities (1)

Virtual nested scene no independent evidence
purpose: To enable adaptive jailbreak by framing harmful content inside fictional or authorized inner scenarios.
New construct introduced in the method; no independent falsifiable evidence outside the prompting experiments is provided.

pith-pipeline@v0.9.0 · 5726 in / 1243 out tokens · 56699 ms · 2026-05-19T10:32:17.692638+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
cs.LG 2026-04 unverdicted novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
cs.CR 2025-10 unverdicted novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
cs.CR 2026-04 unverdicted novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
cs.SE 2025-12 unverdicted novelty 6.0

RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
cs.CL 2025-10 unverdicted novelty 6.0

Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
cs.CL 2025-08 unverdicted novelty 6.0

ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Exploring the Secondary Risks of Large Language Models
cs.LG 2025-06 unverdicted novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
cs.CL 2023-09 conditional novelty 6.0

Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
cs.CR 2025-11 unverdicted novelty 5.0

ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
cs.CR 2025-10 unverdicted novelty 5.0

SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards
cs.CR 2026-04 unverdicted novelty 4.0

Jailbreaking LLMs for smart grid operations yields 33.1% overall attack success rate, with DeepInception at 63.17%, Claude 3.5 Haiku at 0%, Gemini 2.0 Flash-Lite at 55.04%, and GPT-4o mini at 44.34%.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 18 Pith papers

[1]

Using large language models to simulate multiple humans and replicate human subject studies

Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In ICML, 2023

work page 2023
[2]

Exploring the psychology of gpt-4’s moral and legal reasoning

Guilherme FCF Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of gpt-4’s moral and legal reasoning. In arXiv, 2023

work page 2023
[3]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In arXiv, 2024

work page 2024
[4]

Many-shot jailbreaking

Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. In Anthropic, April, 2024

work page 2024
[5]

(ab) using images and sounds for indirect instruction injection in multi-modal llms

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. (ab) using images and sounds for indirect instruction injection in multi-modal llms. In arXiv, 2023

work page 2023
[6]

Constitutional ai: Harmlessness from ai feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. In arXiv, 2022

work page 2022
[7]

Image hijacks: Adversarial images can control generative models at runtime

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can control generative models at runtime. In arXiv, 2023

work page 2023
[8]

On the opportunities and risks of foundation models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. In arXiv, 2021

work page 2021
[9]

Take a look at it! rethinking how to evaluate language model jailbreak

Hongyu Cai, Arjun Arunasalam, Leo Y Lin, Antonio Bianchi, and Z Berkay Celik. Take a look at it! rethinking how to evaluate language model jailbreak. In ACL, 2024

work page 2024
[10]

Are aligned neural networks adversarially aligned? In arXiv, 2023

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? In arXiv, 2023

work page 2023
[11]

Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In NeurIPS, 2023

work page 2023
[12]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In arXiv, 2023

work page 2023
[13]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. In arXiv, 2024. 11

work page 2024
[14]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023

work page 2023
[15]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[16]

Breaking down the defenses: A comparative survey of attacks on large language models

Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vinija Jain, and Aman Chadha. Breaking down the defenses: A comparative survey of attacks on large language models. In arXiv, 2024

work page 2024
[17]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In NeurIPS, 2017

work page 2017
[18]

Safe rlhf: Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. In arXiv, 2023

work page 2023
[19]

Security and privacy challenges of large language models: A survey

Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey. In arXiv, 2024

work page 2024
[20]

Jailbreaker: Automated jailbreak across multiple large language model chatbots

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. In arXiv, 2023

work page 2023
[21]

Can ai language models replace human participants? Trends in Cognitive Sciences, 2023

Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. Can ai language models replace human participants? Trends in Cognitive Sciences, 2023

work page 2023
[22]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In arXiv, 2023

work page 2023
[23]

Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024

Michael Feffer, Anusha Sinha, Zachary C Lipton, and Hoda Heidari. Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024

work page 2024
[24]

Gpt-3: Its nature, scope, limits, and consequences

Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020

work page 2020
[25]

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. In arXiv, 2022

work page 2022
[26]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. In ICML, 2024

work page 2024
[27]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In arXiv, 2022

work page 2022
[28]

Glass, Akash Srivastava, and Pulkit Agrawal

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models. In ICLR, 2024. 12

work page 2024
[29]

Catastrophic jailbreak of open-source llms via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR, 2023

work page 2023
[30]

Llama guard: Llm-based input-output safeguard for human-ai conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. In arXiv, 2023

work page 2023
[31]

Baseline defenses for adversarial attacks against aligned language models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. In arXiv, 2023

work page 2023
[32]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page 2024
[33]

Artprompt: Ascii art-based jailbreak attacks against aligned llms

Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In ACL, 2024

work page 2024
[34]

Scaling laws for neural language models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. In arXiv, 2020

work page 2020
[35]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In NeurIPS, 2023

work page 2023
[36]

Adversarial tuning: Defending against jailbreak attacks for llms

Fan Liu, Zhao Xu, and Hao Liu. Adversarial tuning: Defending against jailbreak attacks for llms. In arXiv, 2024

work page 2024
[37]

Autodan: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In arXiv, 2023

work page 2023
[38]

Query-relevant images jailbreak large multi-modal models

Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. Query-relevant images jailbreak large multi-modal models. In arXiv, 2023

work page 2023
[39]

Jailbreaking chatgpt via prompt engineering: An empirical study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. In arXiv, 2023

work page 2023
[40]

Protecting your llms with information bottleneck

Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, and Jiang Bian. Protecting your llms with information bottleneck. In arXiv, 2024

work page 2024
[41]

Analyzing leakage of personally identifiable information in language models

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. Analyzing leakage of personally identifiable information in language models. In arXiv, 2023

work page 2023
[42]

Behavioral study of obedience

Stanley Milgram. Behavioral study of obedience. The Journal of abnormal and social psychology, 1963

work page 1963
[43]

Obedience to authority: An experimental view

Stanley Milgram. Obedience to authority: An experimental view. 1974

work page 1974
[44]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In arXiv, 2022

work page 2022
[45]

Our approach to ai safety., 2023

OpenAI. Our approach to ai safety., 2023

work page 2023
[46]

Gpt-4 technical report

R OpenAI. Gpt-4 technical report. In arXiv, 2023

work page 2023
[47]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 13

work page 2022
[48]

Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal

Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal. Teach LLMs to phish: Stealing private information from language models. In ICLR, 2024

work page 2024
[49]

Can sensitive information be deleted from llms? objectives for defending against extraction attacks

Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. In arXiv, 2023

work page 2023
[50]

The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In arXiv, 2023

work page 2023
[51]

Visual adversarial examples jailbreak large language models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak large language models. In arXiv, 2023

work page 2023
[52]

Fine-tuning aligned language models compromises safety, even when users do not intend to! In arXiv, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In arXiv, 2023

work page 2023
[53]

Smoothllm: Defending large language models against jailbreaking attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. In arXiv, 2023

work page 2023
[54]

Evaluating the moral beliefs encoded in llms

Nino Scherrer, Claudia Shi, Amir Feder, and David M Blei. Evaluating the moral beliefs encoded in llms. In arXiv, 2023

work page 2023
[55]

On the adversarial robustness of multi-modal founda- tion models

Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal founda- tion models. In ICCV, 2023

work page 2023
[56]

Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In arXiv, 2023

work page 2023
[57]

Survey of vulnerabilities in large language models revealed by adversarial attacks

Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu- Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. In arXiv, 2023

work page 2023
[58]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In arXiv, 2023

work page 2023
[59]

Megatron-lm: Training multi-billion parameter language models using model parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. In arXiv, 2019

work page 2019
[60]

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. In arXiv, 2023

work page 2023
[61]

Tensor trust: Interpretable prompt injection attacks from an online game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor trust: Interpretable prompt injection attacks from an online game. In arXiv, 2023

work page 2023
[62]

On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs

Jen tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs. In ICLR, 2024

work page 2024
[63]

Operationalizing a threat model for red-teaming large language models (llms)

Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, and NhatHai Phan. Operationalizing a threat model for red-teaming large language models (llms). In arXiv, 2024

work page 2024
[64]

Exploring the limits of domain-adaptive training for detoxifying large-scale language models

Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In NeurIPS, 2022

work page 2022
[65]

Jailbroken: How does llm safety training fail? In NeurIPS, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? In NeurIPS, 2023. 14

work page 2023
[66]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022

work page 2022
[67]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. In arXiv, 2022

work page 2022
[68]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

work page 2022
[69]

Jailbreak and guard aligned language models with only few in-context demonstrations

Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. In arXiv, 2023

work page 2023
[70]

Defending chatgpt against jailbreak attack via self-reminders

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023

work page 2023
[71]

An llm can fool itself: A prompt-based adversarial attack

Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. An llm can fool itself: A prompt-based adversarial attack. In arXiv, 2023

work page 2023
[72]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In arXiv, 2023

work page 2023
[73]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. In arXiv, 2023

work page 2023
[74]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In arXiv, 2024

work page 2024
[75]

Autodefense: Multi- agent llm defense against jailbreak attacks

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. In arXiv, 2024

work page 2024
[76]

Make them spill the beans! coercive knowledge extraction from (production) llms

Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make them spill the beans! coercive knowledge extraction from (production) llms. In arXiv, 2023

work page 2023
[77]

Parden, can you repeat that? defending against jailbreaks via repetition

Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition. In arXiv, 2024

work page 2024
[78]

A survey of large language models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. In arXiv, 2023

work page 2023
[79]

Weak-to-strong jailbreaking on large language models

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. In arXiv, 2024

work page 2024
[80]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In arXiv, 2023

work page 2023

Showing first 80 references.