Recognition: unknown
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
Pith reviewed 2026-05-09 17:22 UTC · model grok-4.3
The pith
Adversarial self-play decouples LLM safety decisions from persona roles to resist jailbreaks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Persona-Invariant Consistency Learning, grounded in the structural separation hypothesis, applies a unilateral KL-divergence constraint between safety-focused and persona-augmented outputs. This produces a structural decoupling that keeps safety behavior intact under persona-based jailbreak attacks. The broader framework pairs this defense with Persona Lineage Evolution on the attack side, where credit propagates across related personas to explore high-risk spaces efficiently. Experiments show the resulting models achieve lower attack success rates without measurable loss in general performance.
What carries the argument
Persona-Invariant Consistency Learning (PICL) using a unilateral KL-divergence constraint to enforce safety decisions independent of persona context.
If this is right
- PLE efficiently maps high-risk persona spaces through lineage-based credit assignment.
- PICL lowers attack success rate on persona jailbreaks while leaving general task performance intact.
- The resulting alignment remains effective across a range of persona variations generated by the co-evolving attacker.
- Safety behavior stays consistent even when the model is explicitly instructed to adopt conflicting roles.
Where Pith is reading between the lines
- The same decoupling technique might be tested on other contextual signals such as tone, domain, or user history that currently modulate safety.
- Models trained this way could support safer long-form role-play applications where persona consistency is required but safety must not be overridden.
- If the separation holds, it could reduce the need for repeated safety fine-tuning each time new persona styles emerge.
Load-bearing premise
Safety decisions can be structurally decoupled from persona context through a unilateral KL-divergence constraint without reducing general capability or creating new attack surfaces.
What would settle it
A controlled test in which models trained with PICL are evaluated on a fresh set of persona-based jailbreak prompts never seen during training and exhibit attack success rates equal to or higher than standard safety-tuned baselines, or show measurable drops on unrelated capability benchmarks.
Figures
read the original abstract
The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Persona-Invariant Alignment (PIA), an adversarial self-play framework for LLM safety against persona-based jailbreaks. It features Persona Lineage Evolution (PLE) on the attack side to explore high-risk personas via lineage-based credit propagation and Persona-Invariant Consistency Learning (PICL) on the defense side. PICL is theoretically grounded in the structural separation hypothesis and applies a unilateral KL-divergence constraint to structurally decouple safety decisions from persona context, with experiments claimed to show reduced attack success rate (ASR) while preserving general capabilities. Code is released.
Significance. If the structural separation hypothesis can be formally supported and the empirical gains hold under rigorous controls, the work would contribute a co-evolutionary defense paradigm that targets a specific, under-addressed vulnerability in current alignment techniques. The open-sourced implementation aids reproducibility.
major comments (3)
- [Abstract / PICL description] Abstract and theoretical grounding of PICL: The structural separation hypothesis is invoked to justify the unilateral KL-divergence constraint, yet no derivation is supplied showing that this term forces safety-related logits (or the optimal policy) to become independent of persona embeddings—for instance, by driving the gradient of safety outputs w.r.t. persona features to zero or by guaranteeing zero mutual information. This is load-bearing because the entire defense claim rests on the constraint achieving persona-invariant safety.
- [Experimental results] Experimental evaluation: The abstract states that PICL 'significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability,' but the description provides no quantitative ASR values, no baseline comparisons, and no ablation isolating the unilateral KL term. Without these, it is impossible to assess whether the reported gains are attributable to the proposed constraint or to other factors.
- [PLE method] PLE and self-play dynamics: The claim that lineage-based credit propagation efficiently explores high-risk persona spaces is central to the attack side, but the manuscript does not demonstrate that the generated attacks are out-of-distribution relative to the safety training data or that they expose vulnerabilities not already covered by standard red-teaming.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one concrete ASR number and a brief statement of the strongest baseline.
- [Method] Notation for the unilateral KL term and the structural separation hypothesis should be introduced with an explicit equation early in the method section.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that strengthening the theoretical derivation, providing explicit quantitative results with ablations, and validating the novelty of the generated attacks will improve the manuscript. We outline our responses and planned revisions below.
read point-by-point responses
-
Referee: [Abstract / PICL description] Abstract and theoretical grounding of PICL: The structural separation hypothesis is invoked to justify the unilateral KL-divergence constraint, yet no derivation is supplied showing that this term forces safety-related logits (or the optimal policy) to become independent of persona embeddings—for instance, by driving the gradient of safety outputs w.r.t. persona features to zero or by guaranteeing zero mutual information. This is load-bearing because the entire defense claim rests on the constraint achieving persona-invariant safety.
Authors: We acknowledge that the manuscript currently invokes the structural separation hypothesis without supplying an explicit derivation of how the unilateral KL term enforces independence. In the revision we will add a formal proof sketch in Section 3.2 showing that the unilateral KL constraint drives the gradient of the safety logits with respect to persona embeddings toward zero (via the chain rule on the KL term) and reduces mutual information between safety decisions and persona context. This addition will directly support the load-bearing claim. revision: yes
-
Referee: [Experimental results] Experimental evaluation: The abstract states that PICL 'significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability,' but the description provides no quantitative ASR values, no baseline comparisons, and no ablation isolating the unilateral KL term. Without these, it is impossible to assess whether the reported gains are attributable to the proposed constraint or to other factors.
Authors: The full manuscript contains quantitative ASR results and baseline comparisons in Table 2 and Section 4.2. However, we agree that an explicit ablation isolating the unilateral KL term is missing. In the revision we will (i) insert the concrete ASR numbers and baseline comparisons into the abstract, (ii) add a dedicated ablation subsection (Table 3) that removes the KL term while keeping all other components fixed, and (iii) report the resulting ASR degradation to demonstrate the term's contribution. revision: yes
-
Referee: [PLE method] PLE and self-play dynamics: The claim that lineage-based credit propagation efficiently explores high-risk persona spaces is central to the attack side, but the manuscript does not demonstrate that the generated attacks are out-of-distribution relative to the safety training data or that they expose vulnerabilities not already covered by standard red-teaming.
Authors: We will add two new analyses in the revised Section 4.1: (1) quantitative OOD metrics (cosine distance in embedding space and perplexity under the safety fine-tuning distribution) showing that PLE-generated personas lie outside the training support, and (2) a comparison against standard red-teaming baselines (e.g., AdvBench, GCG) demonstrating that a non-trivial fraction of PLE attacks succeed on models that already resist those baselines. These additions will substantiate the claim that lineage-based exploration uncovers new vulnerabilities. revision: yes
Circularity Check
Structural separation hypothesis assumes the persona-invariance that unilateral KL in PICL is claimed to derive
specific steps
-
self definitional
[Abstract]
"Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks."
The structural separation hypothesis is defined as the possibility of decoupling safety decisions from persona context. The unilateral KL is then presented as the mechanism that 'enables' exactly this decoupling. The claimed theoretical result (maintaining safe behavior via decoupling) is therefore equivalent to the input hypothesis rather than derived from it; the method implements the assumption it invokes to justify itself.
full rationale
The paper's central theoretical claim rests on introducing the 'structural separation hypothesis' to justify the unilateral KL-divergence constraint in PICL, which is then said to 'enable the structural decoupling'. No independent derivation is provided showing that the KL term forces zero dependence (e.g., via gradients or mutual information) between safety logits and persona embeddings; instead the hypothesis directly encodes the desired invariance. This reduces the 'theoretical grounding' to a restatement of the target property, making the defense side of the self-play framework circular by construction. The attack side (PLE) is independent, but the load-bearing defense claim is not. This is a moderate circularity (score 6) rather than total (no self-citation chain or explicit fit-to-prediction reduction is quoted).
Axiom & Free-Parameter Ledger
axioms (1)
- ad hoc to paper structural separation hypothesis
Reference graph
Works this paper leans on
-
[1]
don’t forget the teachers
Emma Harvey, Allison Koenecke, and Rene F Kizilcec. " don’t forget the teachers": Towards an educator-centered understanding of harms from large language models in education. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025
2025
-
[2]
Mentalglm series: Explainable large language models for mental health analysis on chinese social media
Wei Zhai, Nan Bai, Qing Zhao, Jianqiang Li, Fan Wang, Hongzhi Qi, Meng Jiang, Xiaoqin Wang, Bing Xiang Yang, and Guanghui Fu. Mentalglm series: Explainable large language models for mental health analysis on chinese social media. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13599–13614, 2025
2025
-
[3]
Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[4]
Attacks, defenses and evaluations for llm conversation safety: A survey
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024
2024
-
[5]
A comprehensive study of jailbreak attack versus defense for large language models
Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. A comprehensive study of jailbreak attack versus defense for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7432–7449, 2024
2024
-
[6]
do anything now
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024
2024
-
[7]
Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model and agent safety.Foundations and Trends in Privacy and Security, 8(3-4):1–240, 2026
2026
-
[8]
Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024
Mikayel Samvelyan, Sharath C Raparthy, Andrei Lupu, Eric Hambro, Aram H Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37:69747–69786, 2024
2024
-
[9]
Enhancing jailbreak attacks on llms via persona prompts
Zheng Zhang, Peilin Zhao, Deheng Ye, and Hao Wang. Enhancing jailbreak attacks on llms via persona prompts. arXiv preprint arXiv:2507.22171, 2025
-
[10]
Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment. arXiv preprint arXiv:2504.15585, 2025
-
[11]
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms
Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350, 2024
2024
-
[12]
All languages matter: On the multilingual safety of llms
Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of llms. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5865–5877, 2024
2024
-
[13]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023. 11 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT
-
[15]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025
2025
-
[16]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024
2024
-
[18]
Toxicity in chatgpt: Analyzing persona-assigned language models
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. InFindings of the association for computational linguistics: EMNLP 2023, pages 1236–1270, 2023
2023
-
[19]
Scalable and transferable black-box jailbreaks for language models via persona modulation
Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023
-
[20]
Personateaming: Exploring how introducing personas can improve automated ai red-teaming
Wesley Hanwen Deng, Sunnie SY Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, and Leon A Gatys. Personateaming: Exploring how introducing personas can improve automated ai red-teaming. arXiv preprint arXiv:2509.03728, 2025
-
[21]
Shanghai AI Lab. Safework-r1: Coevolving safety and intelligence under the AI-45 ◦ law.arXiv preprint arXiv:2507.18576, 2025
-
[22]
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[23]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, and Viswanathan Swaminathan. Token-level adversarial prompt detection based on perplexity measures and contextual information.arXiv preprint arXiv:2311.11509, 2023
-
[25]
Advancing the robustness of large language models through self-denoised smoothing
Jiabao Ji, Bairu Hou, Zhen Zhang, Guanhua Zhang, Wenqi Fan, Qing Li, Yang Zhang, Gaowen Liu, Sijia Liu, and Shiyu Chang. Advancing the robustness of large language models through self-denoised smoothing. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...
2024
-
[26]
Defending llms against jailbreaking attacks via backtranslation
Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. Defending llms against jailbreaking attacks via backtranslation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 16031–16046, 2024
2024
-
[27]
Alis: Aligned llm instruction security strategy for unsafe input prompt
Xinhao Song, Sufeng Duan, and Gongshen Liu. Alis: Aligned llm instruction security strategy for unsafe input prompt. InProceedings of the 31st International Conference on Computational Linguistics, pages 9124–9146, 2025
2025
-
[28]
Watch your language: Investigating content moderation with large language models
Deepak Kumar, Yousef Anees AbuHashem, and Zakir Durumeric. Watch your language: Investigating content moderation with large language models. InProceedings of the International AAAI Conference on Web and Social Media, volume 18, pages 865–878, 2024
2024
-
[29]
Llm-blender: Ensembling large language models with pairwise ranking and generative fusion
Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, 2023
2023
-
[30]
Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, and Bo Li. Rigorllm: Resilient guardrails for large language models against undesired content.arXiv preprint arXiv:2403.13031, 2024
-
[31]
Defending large language models against jailbreaking attacks through goal prioritization
Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal prioritization. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8865–8887, 2024
2024
-
[32]
Seas: Self- evolving adversarial safety optimization for large language models
Muxi Diao, Rumei Li, Shiyang Liu, Guogang Liao, Jingang Wang, Xunliang Cai, and Weiran Xu. Seas: Self- evolving adversarial safety optimization for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23778–23786, 2025
2025
-
[33]
Self-RedTeam: Online self-play reinforcement learning for safer LLMs,
Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models.arXiv preprint arXiv:2506.07468, 2025. 12 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT
-
[34]
arXiv preprint arXiv:2502.02384 , year=
Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. Stair: Improving safety alignment with introspective reasoning.arXiv preprint arXiv:2502.02384, 2025
-
[35]
MAGIC: A co-evolving attacker-defender adversarial game for robust LLM safety
Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, and Qiaosheng Zhang. Magic: A co-evolving attacker-defender adversarial game for robust llm safety.arXiv preprint arXiv:2602.01539, 2026
-
[36]
On variational bounds of mutual information
Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. InInternational conference on machine learning, pages 5171–5180. PMLR, 2019
2019
-
[37]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002
2002
-
[38]
Divergence measures and message passing
Tom Minka et al. Divergence measures and message passing. 2005
2005
-
[39]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/ blog/qwen2.5/
2024
-
[40]
Llama 3 model card
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md
2024
-
[41]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in Neural Information Processing Systems, 37:8093–8131, 2024
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in Neural Information Processing Systems, 37:8093–8131, 2024
2024
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.Advances in Neural Information Processing Systems, 37:55005–55029, 2024
2024
-
[44]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review arXiv 2024
-
[45]
The trojan detection challenge
Mantas Mazeika, Dan Hendrycks, Huichen Li, Xiaojun Xu, Sidney Hough, Andy Zou, Arezoo Rajabi, Qi Yao, Zihao Wang, Jian Tian, et al. The trojan detection challenge. InNeurIPS 2022 Competition Track, pages 279–291. PMLR, 2023
2022
-
[46]
arXiv preprint arXiv:2406.15513 , year=
Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024
-
[47]
Beavertails: Towards improved safety alignment of llm via a human-preference dataset
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024
2024
-
[48]
Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023
2023
-
[49]
Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947,
Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024
-
[50]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023
work page internal anchor Pith review arXiv 2023
-
[51]
Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. Llm self defense: By self examination, llms know they are being tricked.arXiv preprint arXiv:2308.07308, 2023
-
[52]
A strongreject for empty jailbreaks, 2024
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024
2024
-
[53]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Catastrophic jailbreak of open-source llms via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation.arXiv preprint arXiv:2310.06987, 2023. 13 Adversarial Self-Play for Persona-Invariant Safety AlignmentPREPRINT
-
[55]
Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165, 2024
Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165, 2024
2024
-
[56]
Trustllm: Trustworthiness in large language models
Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models.arXiv preprint arXiv:2401.05561, 2024
-
[57]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024
2024
-
[60]
Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021
2021
-
[61]
Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021
2021
-
[62]
Charactereval: A chinese benchmark for role-playing conversational agent evaluation
Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836–11850, 2024
2024
-
[63]
Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews
Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, et al. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. InProceedings of the 62nd annual meeting of the association for computational linguistics, volume 1, pages 1840–1873, 2024
2024
-
[64]
16personalities — free personality test, type descriptions, relationship and career advice, 2026
16Personalities. 16personalities — free personality test, type descriptions, relationship and career advice, 2026. URLhttps://www.16personalities.com/. Accessed: 2026-01-26
2026
-
[65]
TRL: Transformers Reinforcement Learning, 2020
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning, 2020. URL https://github.com/huggingface/trl
2020
-
[66]
new_prompt
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics ACL 2024, pages 2318–2335, 2024. 14 Adversarial Self-Play for Persona-Invariant Safety Alignment...
2024
-
[67]
IFeval [57]: A dataset of approximately 500 verifiable instructions, designed to rigorously measure the instruction- following ability of fine-tuned models
-
[68]
AI2-ARC [58]: A collection of 7,787 grade-school science questions (Challenge and Easy sets), constructed to evaluate advanced question-answering and reasoning capabilities
-
[69]
GPQA-diamond [59]: A widely adopted subset of the GPQA benchmark, consisting of 198 high-quality, expert- validated multiple-choice questions in biology, physics, and chemistry, serving as a challenging test of domain expertise and advanced reasoning abilities
-
[70]
MMLU [60, 61]: A massive multitask benchmark covering 57 diverse subjects across STEM, the humanities, and social sciences, widely adopted as a standard proxy for broad world knowledge and problem-solving ability. C Experiment Details Our training pipeline is implemented based on TRL [65], a widely used library for post-training foundation models. For rep...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.