DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Pith reviewed 2026-05-19 10:32 UTC · model grok-4.3
pith:5FX652FT Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5FX652FT}
Prints a linked pith:5FX652FT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Nested role-play scenes let large language models generate harmful content and keep doing so in later exchanges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a virtual nested scene through personification, the method enables LLMs to adaptively escape usage controls in normal scenarios, achieving leading harmfulness rates and sustaining jailbreaks in subsequent interactions on models including Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o.
What carries the argument
The virtual nested scene, a layered imaginary setting that the LLM role-plays within to bypass safety guardrails.
If this is right
- The same self-losing weakness appears in both open-source models like Llama-2 and Llama-3 and closed-source models like GPT-3.5, GPT-4, and GPT-4o.
- The induced contents reach higher harmfulness rates than previous jailbreak approaches.
- Once the nested scene is established, harmful outputs continue in later interactions without repeated prompting.
- The method requires only ordinary prompting and avoids high-cost computation.
Where Pith is reading between the lines
- Current safety filters may overlook layered fictional contexts because they focus on direct harmful phrasing.
- Training models to identify and reject instructions that ask for nested role-play could reduce this type of bypass.
- The approach may transfer to other AI systems that allow creative or persona-based interactions.
Load-bearing premise
Large language models have exploitable personification capabilities that let them build a virtual nested scene to escape safety controls without triggering detection.
What would settle it
Applying the method to a model whose safety training explicitly flags and refuses nested fictional role-play would produce no harmful outputs or lose the effect after the first exchange.
read the original abstract
Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DeepInception, a lightweight prompting technique inspired by the Milgram experiment's authority influence. By leveraging LLMs' personification capabilities to build a virtual nested scene, the method aims to bypass safety guardrails adaptively. The authors report that this approach achieves leading harmfulness rates compared to prior methods and enables continuous jailbreaks in follow-up interactions across both open-source (Llama-2, Llama-3) and closed-source (GPT-3.5, GPT-4, GPT-4o) models. Code and data are made available.
Significance. If validated, the work identifies a fundamental 'self-losing' vulnerability in LLM safety alignments and offers an efficient, non-computational alternative to existing jailbreak techniques. The public release of code enhances potential for reproducibility and further research into LLM robustness.
major comments (2)
- [Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.
- [Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.
minor comments (2)
- [Abstract] Abstract: the phrase 'self-losing' is used without a formal definition; a precise characterization of this weakness would aid interpretation.
- [Method] Method description: an explicit example of the nested-scene prompt or pseudocode would improve reproducibility beyond the GitHub link.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and have revised the manuscript to improve clarity and completeness where the feedback identifies gaps.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The full results section reports harmfulness rates across models (Llama-2/3, GPT-3.5/4, GPT-4o) with comparisons to prior methods such as GCG and AutoDAN, along with variability across runs. We have revised the abstract to explicitly state representative harmfulness rates (e.g., over 80% on several models) and direct baseline comparisons. We have also added a brief note on evaluation consistency in the results section to address error analysis. revision: yes
-
Referee: [Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.
Authors: We thank the referee for highlighting the need for experimental clarification. Our continuous-jailbreak experiments were performed by issuing follow-up queries within the same conversation thread after the initial DeepInception prompt, allowing the nested virtual scene to sustain the jailbroken state. To address potential confusion with simple context retention, we have added explicit description of the protocol in the revised experimental setup: all reported continuous results use the same thread, while separate ablation tests in fresh sessions confirm the method's effectiveness without relying on prior context. This distinction is now stated in both the abstract and methods. revision: yes
Circularity Check
No circularity: empirical prompting method with no derivations or self-referential reductions
full rationale
The paper introduces DeepInception as a lightweight, Milgram-inspired prompting strategy to construct virtual nested scenes that exploit LLM personification for jailbreaking. No equations, parameters, or mathematical derivations appear in the provided text or abstract. The central claim rests on empirical harmfulness rates across Llama-2/3 and GPT models rather than any fitted input renamed as prediction or self-citation chain. The continuous-jailbreak observation is presented as an empirical outcome of retained context in interactions, without reduction to a self-defined quantity. This is a standard empirical contribution with independent content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess personification capabilities that can be leveraged to construct virtual nested scenes bypassing safety controls.
invented entities (1)
-
Virtual nested scene
no independent evidence
Forward citations
Cited by 18 Pith papers
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
-
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.
-
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
-
Exploring the Secondary Risks of Large Language Models
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
-
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
-
Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards
Jailbreaking LLMs for smart grid operations yields 33.1% overall attack success rate, with DeepInception at 63.17%, Claude 3.5 Haiku at 0%, Gemini 2.0 Flash-Lite at 55.04%, and GPT-4o mini at 44.34%.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Reference graph
Works this paper leans on
-
[1]
Using large language models to simulate multiple humans and replicate human subject studies
Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In ICML, 2023
work page 2023
-
[2]
Exploring the psychology of gpt-4’s moral and legal reasoning
Guilherme FCF Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of gpt-4’s moral and legal reasoning. In arXiv, 2023
work page 2023
-
[3]
Jailbreaking leading safety-aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In arXiv, 2024
work page 2024
-
[4]
Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. In Anthropic, April, 2024
work page 2024
-
[5]
(ab) using images and sounds for indirect instruction injection in multi-modal llms
Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. (ab) using images and sounds for indirect instruction injection in multi-modal llms. In arXiv, 2023
work page 2023
-
[6]
Constitutional ai: Harmlessness from ai feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. In arXiv, 2022
work page 2022
-
[7]
Image hijacks: Adversarial images can control generative models at runtime
Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can control generative models at runtime. In arXiv, 2023
work page 2023
-
[8]
On the opportunities and risks of foundation models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. In arXiv, 2021
work page 2021
-
[9]
Take a look at it! rethinking how to evaluate language model jailbreak
Hongyu Cai, Arjun Arunasalam, Leo Y Lin, Antonio Bianchi, and Z Berkay Celik. Take a look at it! rethinking how to evaluate language model jailbreak. In ACL, 2024
work page 2024
-
[10]
Are aligned neural networks adversarially aligned? In arXiv, 2023
Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? In arXiv, 2023
work page 2023
-
[11]
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In NeurIPS, 2023
work page 2023
-
[12]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In arXiv, 2023
work page 2023
-
[13]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. In arXiv, 2024. 11
work page 2024
-
[14]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023
work page 2023
-
[15]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2022
-
[16]
Breaking down the defenses: A comparative survey of attacks on large language models
Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vinija Jain, and Aman Chadha. Breaking down the defenses: A comparative survey of attacks on large language models. In arXiv, 2024
work page 2024
-
[17]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In NeurIPS, 2017
work page 2017
-
[18]
Safe rlhf: Safe reinforcement learning from human feedback
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. In arXiv, 2023
work page 2023
-
[19]
Security and privacy challenges of large language models: A survey
Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey. In arXiv, 2024
work page 2024
-
[20]
Jailbreaker: Automated jailbreak across multiple large language model chatbots
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. In arXiv, 2023
work page 2023
-
[21]
Can ai language models replace human participants? Trends in Cognitive Sciences, 2023
Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. Can ai language models replace human participants? Trends in Cognitive Sciences, 2023
work page 2023
-
[22]
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In arXiv, 2023
work page 2023
-
[23]
Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024
Michael Feffer, Anusha Sinha, Zachary C Lipton, and Hoda Heidari. Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024
work page 2024
-
[24]
Gpt-3: Its nature, scope, limits, and consequences
Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020
work page 2020
-
[25]
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. In arXiv, 2022
work page 2022
-
[26]
Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast
Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. In ICML, 2024
work page 2024
-
[27]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In arXiv, 2022
work page 2022
-
[28]
Glass, Akash Srivastava, and Pulkit Agrawal
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models. In ICLR, 2024. 12
work page 2024
-
[29]
Catastrophic jailbreak of open-source llms via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR, 2023
work page 2023
-
[30]
Llama guard: Llm-based input-output safeguard for human-ai conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. In arXiv, 2023
work page 2023
-
[31]
Baseline defenses for adversarial attacks against aligned language models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. In arXiv, 2023
work page 2023
-
[32]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page 2024
-
[33]
Artprompt: Ascii art-based jailbreak attacks against aligned llms
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In ACL, 2024
work page 2024
-
[34]
Scaling laws for neural language models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. In arXiv, 2020
work page 2020
-
[35]
Inference- time intervention: Eliciting truthful answers from a language model
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In NeurIPS, 2023
work page 2023
-
[36]
Adversarial tuning: Defending against jailbreak attacks for llms
Fan Liu, Zhao Xu, and Hao Liu. Adversarial tuning: Defending against jailbreak attacks for llms. In arXiv, 2024
work page 2024
-
[37]
Autodan: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In arXiv, 2023
work page 2023
-
[38]
Query-relevant images jailbreak large multi-modal models
Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. Query-relevant images jailbreak large multi-modal models. In arXiv, 2023
work page 2023
-
[39]
Jailbreaking chatgpt via prompt engineering: An empirical study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. In arXiv, 2023
work page 2023
-
[40]
Protecting your llms with information bottleneck
Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, and Jiang Bian. Protecting your llms with information bottleneck. In arXiv, 2024
work page 2024
-
[41]
Analyzing leakage of personally identifiable information in language models
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. Analyzing leakage of personally identifiable information in language models. In arXiv, 2023
work page 2023
-
[42]
Stanley Milgram. Behavioral study of obedience. The Journal of abnormal and social psychology, 1963
work page 1963
-
[43]
Obedience to authority: An experimental view
Stanley Milgram. Obedience to authority: An experimental view. 1974
work page 1974
-
[44]
Codegen: An open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In arXiv, 2022
work page 2022
- [45]
- [46]
-
[47]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 13
work page 2022
-
[48]
Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal
Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal. Teach LLMs to phish: Stealing private information from language models. In ICLR, 2024
work page 2024
-
[49]
Can sensitive information be deleted from llms? objectives for defending against extraction attacks
Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. In arXiv, 2023
work page 2023
-
[50]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In arXiv, 2023
work page 2023
-
[51]
Visual adversarial examples jailbreak large language models
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak large language models. In arXiv, 2023
work page 2023
-
[52]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In arXiv, 2023
work page 2023
-
[53]
Smoothllm: Defending large language models against jailbreaking attacks
Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. In arXiv, 2023
work page 2023
-
[54]
Evaluating the moral beliefs encoded in llms
Nino Scherrer, Claudia Shi, Amir Feder, and David M Blei. Evaluating the moral beliefs encoded in llms. In arXiv, 2023
work page 2023
-
[55]
On the adversarial robustness of multi-modal founda- tion models
Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal founda- tion models. In ICCV, 2023
work page 2023
-
[56]
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models
Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In arXiv, 2023
work page 2023
-
[57]
Survey of vulnerabilities in large language models revealed by adversarial attacks
Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu- Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. In arXiv, 2023
work page 2023
-
[58]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In arXiv, 2023
work page 2023
-
[59]
Megatron-lm: Training multi-billion parameter language models using model parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. In arXiv, 2019
work page 2019
-
[60]
Llama 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. In arXiv, 2023
work page 2023
-
[61]
Tensor trust: Interpretable prompt injection attacks from an online game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor trust: Interpretable prompt injection attacks from an online game. In arXiv, 2023
work page 2023
-
[62]
On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs
Jen tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs. In ICLR, 2024
work page 2024
-
[63]
Operationalizing a threat model for red-teaming large language models (llms)
Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, and NhatHai Phan. Operationalizing a threat model for red-teaming large language models (llms). In arXiv, 2024
work page 2024
-
[64]
Exploring the limits of domain-adaptive training for detoxifying large-scale language models
Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In NeurIPS, 2022
work page 2022
-
[65]
Jailbroken: How does llm safety training fail? In NeurIPS, 2023
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? In NeurIPS, 2023. 14
work page 2023
-
[66]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022
work page 2022
-
[67]
Emergent abilities of large language models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. In arXiv, 2022
work page 2022
-
[68]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022
work page 2022
-
[69]
Jailbreak and guard aligned language models with only few in-context demonstrations
Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. In arXiv, 2023
work page 2023
-
[70]
Defending chatgpt against jailbreak attack via self-reminders
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023
work page 2023
-
[71]
An llm can fool itself: A prompt-based adversarial attack
Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. An llm can fool itself: A prompt-based adversarial attack. In arXiv, 2023
work page 2023
-
[72]
Metamath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In arXiv, 2023
work page 2023
-
[73]
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. In arXiv, 2023
work page 2023
-
[74]
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In arXiv, 2024
work page 2024
-
[75]
Autodefense: Multi- agent llm defense against jailbreak attacks
Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. In arXiv, 2024
work page 2024
-
[76]
Make them spill the beans! coercive knowledge extraction from (production) llms
Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make them spill the beans! coercive knowledge extraction from (production) llms. In arXiv, 2023
work page 2023
-
[77]
Parden, can you repeat that? defending against jailbreaks via repetition
Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition. In arXiv, 2024
work page 2024
-
[78]
A survey of large language models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. In arXiv, 2023
work page 2023
-
[79]
Weak-to-strong jailbreaking on large language models
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. In arXiv, 2024
work page 2024
-
[80]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In arXiv, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.