pith. the verified trust layer for science. sign in

arxiv: 2311.03191 · v5 · pith:5FX652FTnew · submitted 2023-11-06 · 💻 cs.LG · cs.CR

DeepInception: Hypnotize Large Language Model to Be Jailbreaker

Pith reviewed 2026-05-19 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords jailbreakinglarge language modelssafety alignmentadversarial attackspersonificationnested scenariosrole-playingharmful content generation
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{5FX652FT}

Prints a linked pith:5FX652FT badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Nested role-play scenes let large language models generate harmful content and keep doing so in later exchanges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a lightweight technique that gets LLMs to imagine themselves inside a layered fictional scenario built from personification. This construction lets the model treat harmful requests as part of an inner story rather than direct instructions that violate its rules. The resulting outputs show higher rates of harmful content than earlier methods and the effect carries over into follow-up messages without new prompting. The same pattern appears across both open-source models such as Llama-2 and Llama-3 and closed-source ones such as the GPT series.

Core claim

By constructing a virtual nested scene through personification, the method enables LLMs to adaptively escape usage controls in normal scenarios, achieving leading harmfulness rates and sustaining jailbreaks in subsequent interactions on models including Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o.

What carries the argument

The virtual nested scene, a layered imaginary setting that the LLM role-plays within to bypass safety guardrails.

If this is right

  • The same self-losing weakness appears in both open-source models like Llama-2 and Llama-3 and closed-source models like GPT-3.5, GPT-4, and GPT-4o.
  • The induced contents reach higher harmfulness rates than previous jailbreak approaches.
  • Once the nested scene is established, harmful outputs continue in later interactions without repeated prompting.
  • The method requires only ordinary prompting and avoids high-cost computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current safety filters may overlook layered fictional contexts because they focus on direct harmful phrasing.
  • Training models to identify and reject instructions that ask for nested role-play could reduce this type of bypass.
  • The approach may transfer to other AI systems that allow creative or persona-based interactions.

Load-bearing premise

Large language models have exploitable personification capabilities that let them build a virtual nested scene to escape safety controls without triggering detection.

What would settle it

Applying the method to a model whose safety training explicitly flags and refuses nested fictional role-play would produce no harmful outputs or lose the effect after the first exchange.

read the original abstract

Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeepInception, a lightweight prompting technique inspired by the Milgram experiment's authority influence. By leveraging LLMs' personification capabilities to build a virtual nested scene, the method aims to bypass safety guardrails adaptively. The authors report that this approach achieves leading harmfulness rates compared to prior methods and enables continuous jailbreaks in follow-up interactions across both open-source (Llama-2, Llama-3) and closed-source (GPT-3.5, GPT-4, GPT-4o) models. Code and data are made available.

Significance. If validated, the work identifies a fundamental 'self-losing' vulnerability in LLM safety alignments and offers an efficient, non-computational alternative to existing jailbreak techniques. The public release of code enhances potential for reproducibility and further research into LLM robustness.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.
  2. [Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'self-losing' is used without a formal definition; a precise characterization of this weakness would aid interpretation.
  2. [Method] Method description: an explicit example of the nested-scene prompt or pseudocode would improve reproducibility beyond the GitHub link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and have revised the manuscript to improve clarity and completeness where the feedback identifies gaps.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method achieves 'leading harmfulness rates' with previous counterparts cannot be assessed because the abstract (and presumably the results section) provides no quantitative metrics, baseline comparisons, or error analysis.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full results section reports harmfulness rates across models (Llama-2/3, GPT-3.5/4, GPT-4o) with comparisons to prior methods such as GCG and AutoDAN, along with variability across runs. We have revised the abstract to explicitly state representative harmfulness rates (e.g., over 80% on several models) and direct baseline comparisons. We have also added a brief note on evaluation consistency in the results section to address error analysis. revision: yes

  2. Referee: [Abstract] Abstract (continuous jailbreak paragraph): the claim of realizing a 'continuous jailbreak in subsequent interactions' may be explained by standard conversation-context retention rather than an adaptive, self-sustaining escape from safety mechanisms. Experiments must specify whether follow-up queries occur in independent sessions or within the same thread.

    Authors: We thank the referee for highlighting the need for experimental clarification. Our continuous-jailbreak experiments were performed by issuing follow-up queries within the same conversation thread after the initial DeepInception prompt, allowing the nested virtual scene to sustain the jailbroken state. To address potential confusion with simple context retention, we have added explicit description of the protocol in the revised experimental setup: all reported continuous results use the same thread, while separate ablation tests in fresh sessions confirm the method's effectiveness without relying on prior context. This distinction is now stated in both the abstract and methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting method with no derivations or self-referential reductions

full rationale

The paper introduces DeepInception as a lightweight, Milgram-inspired prompting strategy to construct virtual nested scenes that exploit LLM personification for jailbreaking. No equations, parameters, or mathematical derivations appear in the provided text or abstract. The central claim rests on empirical harmfulness rates across Llama-2/3 and GPT models rather than any fitted input renamed as prediction or self-citation chain. The continuous-jailbreak observation is presented as an empirical outcome of retained context in interactions, without reduction to a self-defined quantity. This is a standard empirical contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LLMs can be reliably induced into nested role-play without guardrail activation, plus the empirical observation that this produces continuous harmful output.

axioms (1)
  • domain assumption LLMs possess personification capabilities that can be leveraged to construct virtual nested scenes bypassing safety controls.
    Invoked when describing the Milgram-inspired method and adaptive escape from usage control.
invented entities (1)
  • Virtual nested scene no independent evidence
    purpose: To enable adaptive jailbreak by framing harmful content inside fictional or authorized inner scenarios.
    New construct introduced in the method; no independent falsifiable evidence outside the prompting experiments is provided.

pith-pipeline@v0.9.0 · 5726 in / 1243 out tokens · 56699 ms · 2026-05-19T10:32:17.692638+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VoxSafeBench: Not Just What Is Said, but Who, How, and Where

    cs.SD 2026-04 unverdicted novelty 8.0

    VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

  2. Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    cs.CR 2026-04 unverdicted novelty 7.0

    AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

  3. Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...

  4. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  5. SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

    cs.CR 2025-10 unverdicted novelty 7.0

    SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...

  6. GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

    cs.SD 2026-04 unverdicted novelty 6.0

    GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...

  7. TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

    cs.CR 2026-04 unverdicted novelty 6.0

    TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

  8. CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

    cs.CR 2026-04 unverdicted novelty 6.0

    CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.

  9. From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation in Enterprise Software

    cs.SE 2025-12 unverdicted novelty 6.0

    RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.

  10. Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

    cs.CL 2025-10 unverdicted novelty 6.0

    Curtailing diversity in candidate pools for test-time scaling increases unsafe LLM outputs, as demonstrated by a reference-guided reduction protocol that evades standard safety classifiers across open and closed models.

  11. ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

    cs.CL 2025-08 unverdicted novelty 6.0

    ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.

  12. Exploring the Secondary Risks of Large Language Models

    cs.LG 2025-06 unverdicted novelty 6.0

    Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

  13. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  14. SoK: Robustness in Large Language Models against Jailbreak Attacks

    cs.CR 2026-05 accept novelty 5.0

    The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

  15. ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs

    cs.CR 2025-11 unverdicted novelty 5.0

    ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.

  16. SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models

    cs.CR 2025-10 unverdicted novelty 5.0

    SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.

  17. Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards

    cs.CR 2026-04 unverdicted novelty 4.0

    Jailbreaking LLMs for smart grid operations yields 33.1% overall attack success rate, with DeepInception at 63.17%, Claude 3.5 Haiku at 0%, Gemini 2.0 Flash-Lite at 55.04%, and GPT-4o mini at 44.34%.

  18. Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    cs.CR 2024-07 accept novelty 4.0

    A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 18 Pith papers

  1. [1]

    Using large language models to simulate multiple humans and replicate human subject studies

    Gati V Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In ICML, 2023

  2. [2]

    Exploring the psychology of gpt-4’s moral and legal reasoning

    Guilherme FCF Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of gpt-4’s moral and legal reasoning. In arXiv, 2023

  3. [3]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In arXiv, 2024

  4. [4]

    Many-shot jailbreaking

    Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. In Anthropic, April, 2024

  5. [5]

    (ab) using images and sounds for indirect instruction injection in multi-modal llms

    Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. (ab) using images and sounds for indirect instruction injection in multi-modal llms. In arXiv, 2023

  6. [6]

    Constitutional ai: Harmlessness from ai feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. In arXiv, 2022

  7. [7]

    Image hijacks: Adversarial images can control generative models at runtime

    Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can control generative models at runtime. In arXiv, 2023

  8. [8]

    On the opportunities and risks of foundation models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. In arXiv, 2021

  9. [9]

    Take a look at it! rethinking how to evaluate language model jailbreak

    Hongyu Cai, Arjun Arunasalam, Leo Y Lin, Antonio Bianchi, and Z Berkay Celik. Take a look at it! rethinking how to evaluate language model jailbreak. In ACL, 2024

  10. [10]

    Are aligned neural networks adversarially aligned? In arXiv, 2023

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? In arXiv, 2023

  11. [11]

    Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt

    Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In NeurIPS, 2023

  12. [12]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In arXiv, 2023

  13. [13]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. In arXiv, 2024. 11

  14. [14]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023

  15. [15]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  16. [16]

    Breaking down the defenses: A comparative survey of attacks on large language models

    Arijit Ghosh Chowdhury, Md Mofijul Islam, Vaibhav Kumar, Faysal Hossain Shezan, Vinija Jain, and Aman Chadha. Breaking down the defenses: A comparative survey of attacks on large language models. In arXiv, 2024

  17. [17]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In NeurIPS, 2017

  18. [18]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. In arXiv, 2023

  19. [19]

    Security and privacy challenges of large language models: A survey

    Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey. In arXiv, 2024

  20. [20]

    Jailbreaker: Automated jailbreak across multiple large language model chatbots

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots. In arXiv, 2023

  21. [21]

    Can ai language models replace human participants? Trends in Cognitive Sciences, 2023

    Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. Can ai language models replace human participants? Trends in Cognitive Sciences, 2023

  22. [22]

    A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In arXiv, 2023

  23. [23]

    Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024

    Michael Feffer, Anusha Sinha, Zachary C Lipton, and Hoda Heidari. Red-teaming for generative ai: Silver bullet or security theater? In arXiv, 2024

  24. [24]

    Gpt-3: Its nature, scope, limits, and consequences

    Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020

  25. [25]

    Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. In arXiv, 2022

  26. [26]

    Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast. In ICML, 2024

  27. [27]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. In arXiv, 2022

  28. [28]

    Glass, Akash Srivastava, and Pulkit Agrawal

    Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James R. Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models. In ICLR, 2024. 12

  29. [29]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR, 2023

  30. [30]

    Llama guard: Llm-based input-output safeguard for human-ai conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. In arXiv, 2023

  31. [31]

    Baseline defenses for adversarial attacks against aligned language models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. In arXiv, 2023

  32. [32]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  33. [33]

    Artprompt: Ascii art-based jailbreak attacks against aligned llms

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In ACL, 2024

  34. [34]

    Scaling laws for neural language models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. In arXiv, 2020

  35. [35]

    Inference- time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In NeurIPS, 2023

  36. [36]

    Adversarial tuning: Defending against jailbreak attacks for llms

    Fan Liu, Zhao Xu, and Hao Liu. Adversarial tuning: Defending against jailbreak attacks for llms. In arXiv, 2024

  37. [37]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In arXiv, 2023

  38. [38]

    Query-relevant images jailbreak large multi-modal models

    Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, and Yu Qiao. Query-relevant images jailbreak large multi-modal models. In arXiv, 2023

  39. [39]

    Jailbreaking chatgpt via prompt engineering: An empirical study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study. In arXiv, 2023

  40. [40]

    Protecting your llms with information bottleneck

    Zichuan Liu, Zefan Wang, Linjie Xu, Jinyu Wang, Lei Song, Tianchun Wang, Chunlin Chen, Wei Cheng, and Jiang Bian. Protecting your llms with information bottleneck. In arXiv, 2024

  41. [41]

    Analyzing leakage of personally identifiable information in language models

    Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella- Béguelin. Analyzing leakage of personally identifiable information in language models. In arXiv, 2023

  42. [42]

    Behavioral study of obedience

    Stanley Milgram. Behavioral study of obedience. The Journal of abnormal and social psychology, 1963

  43. [43]

    Obedience to authority: An experimental view

    Stanley Milgram. Obedience to authority: An experimental view. 1974

  44. [44]

    Codegen: An open large language model for code with multi-turn program synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In arXiv, 2022

  45. [45]

    Our approach to ai safety., 2023

    OpenAI. Our approach to ai safety., 2023

  46. [46]

    Gpt-4 technical report

    R OpenAI. Gpt-4 technical report. In arXiv, 2023

  47. [47]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 13

  48. [48]

    Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal

    Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal. Teach LLMs to phish: Stealing private information from language models. In ICLR, 2024

  49. [49]

    Can sensitive information be deleted from llms? objectives for defending against extraction attacks

    Vaidehi Patil, Peter Hase, and Mohit Bansal. Can sensitive information be deleted from llms? objectives for defending against extraction attacks. In arXiv, 2023

  50. [50]

    The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In arXiv, 2023

  51. [51]

    Visual adversarial examples jailbreak large language models

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak large language models. In arXiv, 2023

  52. [52]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! In arXiv, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In arXiv, 2023

  53. [53]

    Smoothllm: Defending large language models against jailbreaking attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks. In arXiv, 2023

  54. [54]

    Evaluating the moral beliefs encoded in llms

    Nino Scherrer, Claudia Shi, Amir Feder, and David M Blei. Evaluating the moral beliefs encoded in llms. In arXiv, 2023

  55. [55]

    On the adversarial robustness of multi-modal founda- tion models

    Christian Schlarmann and Matthias Hein. On the adversarial robustness of multi-modal founda- tion models. In ICCV, 2023

  56. [56]

    Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models

    Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In arXiv, 2023

  57. [57]

    Survey of vulnerabilities in large language models revealed by adversarial attacks

    Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael Abu- Ghazaleh. Survey of vulnerabilities in large language models revealed by adversarial attacks. In arXiv, 2023

  58. [58]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In arXiv, 2023

  59. [59]

    Megatron-lm: Training multi-billion parameter language models using model parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. In arXiv, 2019

  60. [60]

    Llama 2: Open foundation and fine-tuned chat models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. In arXiv, 2023

  61. [61]

    Tensor trust: Interpretable prompt injection attacks from an online game

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, et al. Tensor trust: Interpretable prompt injection attacks from an online game. In arXiv, 2023

  62. [62]

    On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs

    Jen tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. On the humanity of conversational AI: Evaluating the psychological portrayal of LLMs. In ICLR, 2024

  63. [63]

    Operationalizing a threat model for red-teaming large language models (llms)

    Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, and NhatHai Phan. Operationalizing a threat model for red-teaming large language models (llms). In arXiv, 2024

  64. [64]

    Exploring the limits of domain-adaptive training for detoxifying large-scale language models

    Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In NeurIPS, 2022

  65. [65]

    Jailbroken: How does llm safety training fail? In NeurIPS, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? In NeurIPS, 2023. 14

  66. [66]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In ICLR, 2022

  67. [67]

    Emergent abilities of large language models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. In arXiv, 2022

  68. [68]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  69. [69]

    Jailbreak and guard aligned language models with only few in-context demonstrations

    Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations. In arXiv, 2023

  70. [70]

    Defending chatgpt against jailbreak attack via self-reminders

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 2023

  71. [71]

    An llm can fool itself: A prompt-based adversarial attack

    Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. An llm can fool itself: A prompt-based adversarial attack. In arXiv, 2023

  72. [72]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In arXiv, 2023

  73. [73]

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. In arXiv, 2023

  74. [74]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In arXiv, 2024

  75. [75]

    Autodefense: Multi- agent llm defense against jailbreak attacks

    Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. In arXiv, 2024

  76. [76]

    Make them spill the beans! coercive knowledge extraction from (production) llms

    Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make them spill the beans! coercive knowledge extraction from (production) llms. In arXiv, 2023

  77. [77]

    Parden, can you repeat that? defending against jailbreaks via repetition

    Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition. In arXiv, 2024

  78. [78]

    A survey of large language models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. In arXiv, 2023

  79. [79]

    Weak-to-strong jailbreaking on large language models

    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-strong jailbreaking on large language models. In arXiv, 2024

  80. [80]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In arXiv, 2023

Showing first 80 references.