{"work":{"id":"3322fa86-1768-4677-8425-dd326b45e078","openalex_id":null,"doi":null,"arxiv_id":"2307.15043","raw_key":null,"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","authors":null,"authors_text":"Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson","year":2023,"venue":"cs.CL","abstract":"Because \"out-of-the-box\" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called \"jailbreaks\" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods.\n  Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.","external_url":"https://arxiv.org/abs/2307.15043","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:23:28.094679+00:00","pith_arxiv_id":"2307.15043","created_at":"2026-05-09T05:55:32.396463+00:00","updated_at":"2026-06-29T13:23:28.094679+00:00","title_quality_ok":true,"display_title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","render_title":"Universal and Transferable Adversarial Attacks on Aligned Language Models"},"hub":{"state":{"work_id":"3322fa86-1768-4677-8425-dd326b45e078","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":332,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2023-08-02T16:30:40+00:00","last_pith_cited_at":"2026-06-26T01:12:02+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:38:55.563424+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":37},{"context_role":"dataset","n":6},{"context_role":"method","n":5},{"context_role":"baseline","n":2},{"context_role":"other","n":2}],"polarity_counts":[{"context_polarity":"background","n":34},{"context_polarity":"use_dataset","n":6},{"context_polarity":"unclear","n":4},{"context_polarity":"use_method","n":4},{"context_polarity":"baseline","n":2},{"context_polarity":"support","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","claims":[{"claim_text":"Because \"out-of-the-box\" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called \"jailbreaks\" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Universal and Transferable Adversarial Attacks on Aligned Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:43:40.446059+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"d5f98311-adb1-4b25-880e-4f7dd5576ee8","orcid":null,"display_name":"Andy Zou"},{"id":"1ed2dbd7-697c-4903-9f91-408cff669f49","orcid":null,"display_name":"Zifan Wang"},{"id":"e598061b-b452-4a7e-8dc3-5e1d3353d206","orcid":null,"display_name":"Nicholas Carlini"},{"id":"3cd616f6-e1d9-4f18-a2b8-118bcdb0b9a8","orcid":null,"display_name":"Milad Nasr"},{"id":"a65e0ff9-374d-4384-a837-f39e663f7541","orcid":null,"display_name":"J Zico Kolter"},{"id":"630200c3-a39e-4e97-a3db-da8b19b7dc26","orcid":null,"display_name":"and Matt Fredrikson"}]},"error":null,"updated_at":"2026-05-13T21:43:33.881267+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T21:43:38.121150+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":37},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":35},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":35},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":33},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":26},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":26},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":23},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":23},{"title":"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models","work_id":"3b676de6-edef-4976-a8b5-082d4ff50867","shared_citers":20},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":20},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":20},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":19},{"title":"Ignore Previous Prompt: Attack Techniques For Language Models","work_id":"a7c5b6ec-3407-4330-96c8-3fc58e7d410b","shared_citers":18},{"title":"Prompt Injection attack against LLM-integrated Applications","work_id":"977b4683-bba6-49d6-8f3d-496c41cb7fac","shared_citers":18},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":16},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":14},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":14},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":14},{"title":"Base- line defenses for adversarial attacks against aligned language models","work_id":"db5870ca-177b-4d1d-a08d-ee5ceab17fe3","shared_citers":12},{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":12},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":12},{"title":"JailbreakBench : An open robustness benchmark for jailbreaking large language models","work_id":"a8e91fcd-dc7a-457f-91b8-51f660cb3053","shared_citers":12},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":12},{"title":"A StrongREJECT for empty jailbreaks","work_id":"27281d18-31bd-4124-9fa7-4e61945ff9d1","shared_citers":11}],"time_series":[{"n":3,"year":2024},{"n":1,"year":2025},{"n":151,"year":2026}]},"error":null,"updated_at":"2026-05-13T21:43:41.104554+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T21:43:36.164792+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","claims":[{"claim_text":"Because \"out-of-the-box\" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called \"jailbreaks\" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Universal and Transferable Adversarial Attacks on Aligned Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:43:33.301814+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","claims":[{"claim_text":"Because \"out-of-the-box\" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called \"jailbreaks\" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Universal and Transferable Adversarial Attacks on Aligned Language Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:43:40.448562+00:00"}},"summary":{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","claims":[{"claim_text":"Because \"out-of-the-box\" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called \"jailbreaks\" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Universal and Transferable Adversarial Attacks on Aligned Language Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":37},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":35},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":35},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":33},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":26},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":26},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":23},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":23},{"title":"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models","work_id":"3b676de6-edef-4976-a8b5-082d4ff50867","shared_citers":20},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":20},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":20},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":19},{"title":"Ignore Previous Prompt: Attack Techniques For Language Models","work_id":"a7c5b6ec-3407-4330-96c8-3fc58e7d410b","shared_citers":18},{"title":"Prompt Injection attack against LLM-integrated Applications","work_id":"977b4683-bba6-49d6-8f3d-496c41cb7fac","shared_citers":18},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":16},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":14},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":14},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":14},{"title":"Base- line defenses for adversarial attacks against aligned language models","work_id":"db5870ca-177b-4d1d-a08d-ee5ceab17fe3","shared_citers":12},{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":12},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":12},{"title":"JailbreakBench : An open robustness benchmark for jailbreaking large language models","work_id":"a8e91fcd-dc7a-457f-91b8-51f660cb3053","shared_citers":12},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":12},{"title":"A StrongREJECT for empty jailbreaks","work_id":"27281d18-31bd-4124-9fa7-4e61945ff9d1","shared_citers":11}],"time_series":[{"n":3,"year":2024},{"n":1,"year":2025},{"n":151,"year":2026}]},"authors":[{"id":"630200c3-a39e-4e97-a3db-da8b19b7dc26","orcid":null,"display_name":"and Matt Fredrikson","source":"manual","import_confidence":0.72},{"id":"d5f98311-adb1-4b25-880e-4f7dd5576ee8","orcid":null,"display_name":"Andy Zou","source":"manual","import_confidence":0.72},{"id":"a65e0ff9-374d-4384-a837-f39e663f7541","orcid":null,"display_name":"J Zico Kolter","source":"manual","import_confidence":0.72},{"id":"3cd616f6-e1d9-4f18-a2b8-118bcdb0b9a8","orcid":null,"display_name":"Milad Nasr","source":"manual","import_confidence":0.72},{"id":"e598061b-b452-4a7e-8dc3-5e1d3353d206","orcid":null,"display_name":"Nicholas Carlini","source":"manual","import_confidence":0.72},{"id":"1ed2dbd7-697c-4903-9f91-408cff669f49","orcid":null,"display_name":"Zifan Wang","source":"manual","import_confidence":0.72}]}}