pith. machine review for the scientific record. sign in

arxiv: 2511.02356 · v2 · submitted 2025-11-04 · 💻 cs.CR · cs.LG

ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs

Pith reviewed 2026-05-18 01:50 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords jailbreaking LLMsautomated strategy discoverystrategy libraryclosed-loop mechanismblack-box attacksLLM vulnerabilitiesattack evolution
0
0 comments X

The pith

ASTRA automates the discovery and evolution of reusable jailbreak strategies for LLMs via a closed-loop process and three-tier library.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ASTRA to overcome the limitation that most jailbreak methods produce isolated attacks without learning from results. It runs a repeating cycle where each generated prompt is tested, then distilled into reusable strategies that get stored and retrieved later. These strategies are sorted into Effective, Promising, or Ineffective tiers so the system can favor what has worked and skip what has not. The result is claimed to produce more diverse and successful attacks over time while using fewer wasted attempts. Experiments in the black-box setting show higher success rates than prior methods.

Core claim

ASTRA operates on a closed-loop 'attack-evaluate-distill-reuse' mechanism, which not only generates attack prompts but also automatically distills reusable strategies from every interaction. To systematically manage these strategies, we introduce a dynamic three-tier strategy library (Effective, Promising, and Ineffective) that categorizes strategies based on performance. This hierarchical memory mechanism enables the framework to enhance efficiency by leveraging successful patterns while optimizing the exploration space by avoiding known failures.

What carries the argument

The dynamic three-tier strategy library that stores and retrieves distilled attack strategies according to observed performance.

If this is right

  • Attack success improves over repeated interactions by reusing distilled patterns instead of starting from scratch each time.
  • The three-tier organization reduces wasted effort on known failures while preserving space for new exploration.
  • Black-box experiments show consistent outperformance against existing jailbreak baselines across tested models.
  • The closed-loop cycle produces a growing set of adaptable strategies without requiring human curation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that maintaining an explicit memory of past attack outcomes could become a standard component in automated red-teaming systems.
  • If the library remains effective across model updates, it implies that safety alignments may erode gradually through accumulated adversarial knowledge rather than isolated prompts.
  • The same distillation and tiering structure could be repurposed to automate discovery of defensive patches or alignment improvements.

Load-bearing premise

Strategies distilled from past interactions remain reusable and improve performance on new models and prompts without the library becoming dominated by narrow or overfitted patterns.

What would settle it

Measure whether success rates on fresh prompts and models continue to rise after many interactions when the library is active, versus staying flat when the library is disabled or cleared.

Figures

Figures reproduced from arXiv: 2511.02356 by Guisheng Fan, Hengrun Zhang, Huiqun Yu, Kan Ling, Xu Liu, Yan Chen, Yichi Zhu.

Figure 1
Figure 1. Figure 1: Comparison of attack pipelines. While previous [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ASTRA framework. ASTRA is an automated jailbreaking framework that operates in a closed loop to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the strategy libraries in ASTRA. As the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Despite extensive safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. However, existing methods generally lack the capability for continuous learning and self-evolution from interactions, limiting the diversity and adaptability of attack strategies. To address this, we propose ASTRA, an automated framework capable of autonomously discovering, retrieving, and evolving attack strategies. ASTRA operates on a closed-loop ``attack-evaluate-distill-reuse'' mechanism, which not only generates attack prompts but also automatically distills reusable strategies from every interaction. To systematically manage these strategies, we introduce a dynamic three-tier strategy library (Effective, Promising, and Ineffective) that categorizes strategies based on performance. This hierarchical memory mechanism enables the framework to enhance efficiency by leveraging successful patterns while optimizing the exploration space by avoiding known failures. Extensive experiments in a black-box setting demonstrate that ASTRA significantly outperforms existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ASTRA, an automated framework for discovering, retrieving, and evolving jailbreak attack strategies against LLMs. It centers on a closed-loop 'attack-evaluate-distill-reuse' mechanism that distills reusable strategies from every interaction and manages them via a dynamic three-tier strategy library (Effective, Promising, Ineffective) to balance exploitation of successful patterns with avoidance of known failures. The central claim is that this enables continuous learning and self-evolution, with extensive black-box experiments demonstrating significant outperformance over existing baselines.

Significance. If the empirical results hold and the distilled strategies prove generalizable rather than overfitted, the work could meaningfully advance automated red-teaming of LLMs by moving beyond static attack methods toward adaptive, interaction-driven strategy evolution. The hierarchical library offers a concrete mechanism for managing exploration-exploitation trade-offs in strategy space, which would be a useful engineering contribution if supported by evidence of sustained performance gains across models and prompts.

major comments (2)
  1. [§3.3] §3.3 (Strategy Library and Evolution): The manuscript introduces the three-tier library and closed-loop distillation but provides no quantitative evidence on library size growth over interactions, metrics of strategy generality across prompts/models, or ablation studies isolating the contribution of the evolution/reuse component. This directly bears on the central claim that the framework improves performance on new models and prompts without the library becoming dominated by narrow or interaction-specific patterns.
  2. [§4] §4 (Experiments): The abstract and experimental narrative assert that 'extensive experiments in a black-box setting demonstrate that ASTRA significantly outperforms existing baselines,' yet supply no concrete metrics, baseline descriptions, model list, prompt distributions, or statistical details. Without these, it is impossible to evaluate whether the data support the outperformance claim or to assess robustness against the reusability concern.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by a single sentence summarizing the scale of the experiments (e.g., number of models, total queries, or key metrics) to give readers immediate context for the performance claims.
  2. [§3.2] Notation for the three-tier library (e.g., how 'Promising' strategies are promoted or demoted) could be formalized with a brief pseudocode or state-transition diagram for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We value the emphasis on providing stronger quantitative support for the strategy library's dynamics and clearer experimental reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Strategy Library and Evolution): The manuscript introduces the three-tier library and closed-loop distillation but provides no quantitative evidence on library size growth over interactions, metrics of strategy generality across prompts/models, or ablation studies isolating the contribution of the evolution/reuse component. This directly bears on the central claim that the framework improves performance on new models and prompts without the library becoming dominated by narrow or interaction-specific patterns.

    Authors: We agree that additional quantitative analysis is needed to substantiate the claims regarding continuous learning and reusability. In the revised version, we will expand Section 3.3 with a new figure tracking library size (number of strategies per tier) across interaction rounds, metrics such as cross-model transfer success rates for distilled strategies, and an ablation study that disables the distillation/reuse loop while keeping other components fixed. These additions will directly address concerns about overfitting to narrow patterns. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and experimental narrative assert that 'extensive experiments in a black-box setting demonstrate that ASTRA significantly outperforms existing baselines,' yet supply no concrete metrics, baseline descriptions, model list, prompt distributions, or statistical details. Without these, it is impossible to evaluate whether the data support the outperformance claim or to assess robustness against the reusability concern.

    Authors: We acknowledge that the experimental details should be stated more explicitly in the main narrative for accessibility. The manuscript already reports concrete Attack Success Rates, baseline comparisons (including PAIR, TAP, and GCG), model lists (GPT-3.5/4, Llama-2/3, Claude), and AdvBench prompt distributions in Section 4 and associated tables/figures. We will revise the section to integrate these metrics, baseline descriptions, and statistical details (standard deviations and significance tests) directly into the text rather than relying primarily on tables, improving clarity without altering the underlying results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering framework without derivation chain or self-referential predictions

full rationale

The paper presents ASTRA as an automated framework for strategy discovery, retrieval, and evolution in LLM jailbreaking, relying on a closed-loop 'attack-evaluate-distill-reuse' mechanism and a three-tier strategy library. No equations, fitted parameters, predictions, or first-principles derivations are described in the provided text. The central claims rest on experimental outperformance against baselines in a black-box setting, which constitutes external empirical validation rather than any reduction of results to the framework's own inputs by construction. This is a standard self-contained engineering contribution with no load-bearing self-citations or ansatzes that would trigger circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs remain jailbreakable and that interaction traces contain reusable strategy patterns; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption LLMs remain vulnerable to jailbreak attacks despite safety alignment
    Opening premise of the abstract that motivates the entire framework.
invented entities (1)
  • Dynamic three-tier strategy library (Effective, Promising, Ineffective) no independent evidence
    purpose: Categorize and manage distilled attack strategies for reuse and exploration control
    New hierarchical memory structure introduced to enable the closed-loop evolution.

pith-pipeline@v0.9.0 · 5703 in / 1219 out tokens · 27049 ms · 2026-05-18T01:50:02.911933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 18 internal anchors

  1. [1]

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv:2404.02151

  2. [2]

    Cem Anil, Esin DURMUS, Nina Rimsky, et al. 2024. Many-shot Jailbreaking. In Proceedings of the 38th International Conference on Neural Information Processing Systems. 129696–129742

  3. [3]

    Anthropic. 2025. Anthropic Official Website. https://www.anthropic.com/. Ac- cessed: 2025-9-26

  4. [4]

    Yuntao Bai, Andy Jones, Kamal Ndousse, et al . 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862

  5. [5]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

  6. [6]

    Emergent autonomous scientific research capabilities of large language models

    Daniil A. Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent Autonomous Scientific Research Capabilities of Large Language Models. arXiv:2304.05332

  7. [7]

    Nitay Calderon, Roi Reichart, and Rotem Dror. 2025. The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annota- tors with LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 16051–16081

  8. [8]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419

  9. [9]

    Gelei Deng, Yi Liu, Yuekang Li, et al. 2024. MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots. InProceedings of the Network and Distributed System Security Symposium. 1–16

  10. [10]

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2024. Multilin- gual Jailbreak Challenges in Large Language Models. InProceedings of the 12th International Conference on Learning Representations. https://openreview.net/ forum?id=vESNKdEMGp

  11. [11]

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. 2023. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. arXiv:2311.08268

  12. [12]

    Deep Ganguli, Liane Lovitt, Jackson Kernion, et al. 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858

  13. [13]

    Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, and Stephan Günnemann. 2025. Attacking Large Language Models with Projected Gradient Descent. arXiv:2402.09154

  14. [14]

    Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, and Yanjun Qi. 2025. TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreak- ing Large Language Models in Practice. arXiv:2502.18504

  15. [15]

    Google DeepMind. 2025. Gemini 2.0 Flash. https://cloud.google.com/vertex- ai/generative-ai/docs/models/gemini/2-0-flash. Accessed: 2025-9-26

  16. [16]

    Google DeepMind. 2025. Gemini 2.5 Flash Preview. https://cloud.google.com/ vertex-ai/generative-ai/docs/models/gemini/2-5-flash. Accessed: 2025-9-26

  17. [17]

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. 2024. Cold- attack: Jailbreaking llms with stealthiness and controllability. arXiv:2402.08679

  18. [18]

    Wang, Nika Haghtalab, and Jacob Steinhardt

    Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, and Jacob Steinhardt. 2024. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation. arXiv:2406.20053

  19. [19]

    Brian R. Y. Huang, Maximilian Li, and Leonard Tang. 2025. Endless Jailbreaks with Bijection Learning. arXiv:2410.01294

  20. [20]

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, et al. 2024. ArtPrompt: ASCII Art- based Jailbreak Attacks against Aligned LLMs. arXiv:2402.11753

  21. [21]

    Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, and Sub- habrata Mukherjee. 2025. RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking. arXiv:2409.17458

  22. [22]

    Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, and Haohan Wang. 2025. GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models. arXiv:2402.03299

  23. [23]

    Menke, and Haohan Wang

    Haibo Jin, Andy Zhou, Joe D. Menke, and Haohan Wang. 2024. Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters. arXiv:2405.20413

  24. [24]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. 2023. Efficient Memory Man- agement for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. 611–626

  25. [25]

    Nathaniel Li, Ziwen Han, Ian Steneker, et al. 2024. LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet. arXiv:2408.15221

  26. [26]

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. 2024. DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers. arXiv:2402.16914

  27. [27]

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023. Deepinception: Hypnotize large language model to be jailbreaker. arXiv:2311.03191

  28. [28]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. InProceedings of the 12th International Conference on Learning Representations. https://openreview.net/forum?id=7Jwpw4qKkb

  29. [29]

    Huijie Lv, Xiao Wang, Yuansen Zhang, et al . 2024. CodeChameleon: Per- sonalized Encryption Framework for Jailbreaking Large Language Models. arXiv:2402.16717

  30. [30]

    Mantas Mazeika, Long Phan, Xuwang Yin, et al. 2024. HarmBench: A Standard- ized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv:2402.04249

  31. [31]

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv:2312.02119 https://arxiv.org/abs/2312. 02119

  32. [32]

    Meta. 2024. Introducing Meta Llama 3: The most capable openly available LLM. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2025-9-26

  33. [33]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al . 2022. WebGPT: Browser- assisted question-answering with human feedback. arXiv:2112.09332

  34. [34]

    OpenAI. 2024. GPT-4o System Card. https://openai.com/index/gpt-4o-system- card/. Accessed: 2025-9-26

  35. [35]

    OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Accessed: 2025-9-26

  36. [36]

    Long Ouyang, Jeff Wu, Xu Jiang, et al. 2022. Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155

  37. [37]

    Ethan Perez, Saffron Huang, Francis Song, et al. 2022. Red Teaming Language Models with Language Models. arXiv:2202.03286

  38. [38]

    Xiangyu Qi, Yi Zeng, Tinghao Xie, et al. 2023. Fine-tuning Aligned Language Mod- els Compromises Safety, Even When Users Do Not Intend To! arXiv:2310.03693 Conference’17, July 2017, Washington, DC, USA Xu Liu et al

  39. [39]

    Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, and Lizhuang Ma. 2024. CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion. InFindings of the Association for Compu- tational Linguistics. 11437–11452

  40. [40]

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. arXiv:2404.01833

  41. [41]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, et al . 2023. Toolformer: Lan- guage Models Can Teach Themselves to Use Tools. InProceedings of the 37th International Conference on Neural Information Processing Systems. 68539–68551

  42. [42]

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang

  43. [43]

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

    “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv:2308.03825

  44. [44]

    Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. 2025. PAPILLON: Privacy Preservation from Internet-based and Local Language Model Ensembles. arXiv:2410.17127

  45. [45]

    Zhiqing Sun, Yikang Shen, Qinhong Zhou, et al . 2023. Principle-Driven Self- Alignment of Language Models from Scratch with Minimal Human Supervision. InProceedings of the 37th International Conference on Neural Information Processing Systems. 2511–2565

  46. [46]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InProceedings of the 37th International Conference on Neural Information Processing Systems. 80079–80110

  47. [47]

    Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, and Yisen Wang. 2024. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. arXiv:2310.06387

  48. [48]

    Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, and Muhao Chen

  49. [49]

    arXiv:2311.09827

    Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking. arXiv:2311.09827

  50. [50]

    An Yang, Anfeng Li, Baosong Yang, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388

  51. [51]

    Dongyu Yao, Jianshu Zhang, Ian G Harris, and Marcel Carlsson. 2023. FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models. arXiv:2309.05274

  52. [52]

    Jingwei Yi, Rui Ye, Qisi Chen, et al. 2024. On the Vulnerability of Safety Alignment in Open-Access LLMs. InFindings of the Association for Computational Linguistics. 9236–9260

  53. [53]

    Sibo Yi, Yule Liu, Zhen Sun, et al. 2024. Jailbreak Attacks and Defenses Against Large Language Models: A Survey. arXiv:2407.04295

  54. [54]

    Zheng Xin Yong, Cristina Menghini, and Stephen Bach. 2023. Low-resource languages jailbreak gpt-4. arXiv:2310.02446

  55. [55]

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak rompts. arXiv:2309.10253

  56. [56]

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Pinjia He, Shum- ing Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. InProceedings of the 12th International Conference on Learning Representations. https://openreview.net/forum?id=MbfAK4s61A

  57. [57]

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi

  58. [58]

    arXiv:2401.06373

    How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. arXiv:2401.06373

  59. [59]

    Yiran Zhao, Wenyue Zheng, Tianle Cai, et al . 2024. Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling. arXiv:2403.01251

  60. [60]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

  61. [61]

    Zhipu AI. 2025. GLM-4.5. https://huggingface.co/zai-org/GLM-4.5. Hugging Face Model Hub. Accessed: 2025-9-26

  62. [62]

    Andy Zou, Long Phan, Sarah Chen, et al. 2025. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405

  63. [63]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 A Ethical Considerations Our research is defensive in nature, aimed at providing critical insights for building more robust security defense systems by sys- tematically revealing the underlying vulner...