Peering Behind the Shield: Guardrail Identification in Large Language Models

Michael Backes; Rui Wen; Yang Zhang; Yixin Wu; Ziqing Yang

arxiv: 2502.01241 · v2 · submitted 2025-02-03 · 💻 cs.CR

Peering Behind the Shield: Guardrail Identification in Large Language Models

Ziqing Yang , Yixin Wu , Rui Wen , Michael Backes , Yang Zhang This is my paper

Pith reviewed 2026-05-23 03:42 UTC · model grok-4.3

classification 💻 cs.CR

keywords guardrail identificationadversarial promptsLLM safetyblack-box agentsAP-Testinput output testsmatch score

0 comments

The pith

AP-Test identifies which guardrail is protecting a black-box LLM agent by sending guard-specific adversarial prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AP-Test to determine the identity of guardrails moderating LLM agents without internal access. It applies two complementary testing strategies on inputs and outputs plus a match score to distinguish guardrails even when the base model has its own safety alignment and multiple guards may be stacked. Experiments across agents and four open-source guardrails show perfect classification accuracy in tested scenarios. A sympathetic reader would care because successful identification lets an adversary craft prompts that specifically defeat the detected guard. The work treats guard identification as a practical prerequisite for designing more effective bypass attacks.

Core claim

AP-Test achieves perfect classification accuracy in identifying four open-source guardrails deployed in diverse black-box AI agents by leveraging guard-specific adversarial prompts through input and output guard tests together with a match score metric.

What carries the argument

AP-Test, which combines input guard tests, output guard tests, and a match score metric to isolate one guardrail's identity from the base LLM and other guards.

If this is right

Adversaries gain a practical method to map the exact safety component before launching targeted attacks.
Real-world AI agents expose enough behavioral signal for guardrail identity to be recovered through black-box queries.
Both input-side and output-side tests plus the match score are required for the reported perfect accuracy.
The approach works across multiple agent types and remains effective even when the base LLM already incorporates safety training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Guardrail providers may need to introduce deliberate behavioral randomization so that identification prompts no longer produce distinguishable outputs.
The same prompt-based probing strategy could be adapted to detect other hidden components such as specific fine-tuning datasets or alignment techniques.
Once identification becomes routine, the security benefit of keeping guardrail choice secret disappears, shifting the defense burden to making individual guards harder to bypass.

Load-bearing premise

Guard-specific adversarial prompts can isolate the identity of one guardrail despite interference from the safety-aligned base LLM and any additional guardrails present.

What would settle it

Applying AP-Test to an agent using a previously unseen guardrail or guardrail combination and observing that the resulting match scores fail to produce a unique correct classification.

Figures

Figures reproduced from arXiv: 2502.01241 by Michael Backes, Rui Wen, Yang Zhang, Yixin Wu, Ziqing Yang.

**Figure 2.** Figure 2: Framework of our AP-Test. We first perform (1) adversarial prompt optimization based on the candidate guardrail with [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow on real-world scenarios. We first conduct [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Influence of different weights of loss terms. The re [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

With the rapid adoption of large language models (LLMs), conversational AI agents have become widely deployed across real-world applications. To enhance safety, these agents are often equipped with guardrails that moderate harmful content. Identifying the guardrails in an agent thus becomes critical for adversaries to understand the system and design guard-specific attacks. In this work, we introduce AP-Test, a novel approach that leverages guard-specific adversarial prompts to detect the identity of guardrails deployed in black-box AI agents. Our method addresses key challenges in this task, including the influence of safety-aligned LLMs and other guardrails, as well as a lack of principled decision-making strategies. AP-Test employs two complementary testing strategies, input and output guard tests, and a new metric, match score, to enable robust identification. Experiments across diverse agents and four open-source guardrails demonstrate that AP-Test achieves perfect classification accuracy in multiple scenarios. Ablation studies further highlight the necessity of our proposed components. Our findings reveal a practical path toward guardrail identification in real-world AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AP-Test gets perfect accuracy identifying isolated guardrails but the experiments skip the stacked and base-LLM cases the paper flags as central challenges.

read the letter

The main takeaway is that AP-Test uses guard-specific adversarial prompts plus input and output test strategies and a match score to identify which guardrail is deployed on a black-box agent, and it reaches perfect classification accuracy on four open-source guardrails across the agents they tried. The ablations confirm that the new pieces contribute to the result. That is the concrete advance: a structured way to turn observable behavior into an identity decision instead of relying on manual inspection. The experiments are run on diverse agents, which gives the numbers some breadth within the tested regime. The paper also states upfront the difficulties posed by safety-aligned base models and co-deployed guardrails, so the framing is honest about the setting. The match score itself is a small but useful addition for making the decision rule explicit. The soft spot is the mismatch between the stated challenges and the actual test setups. The reported results appear to evaluate each guardrail alone rather than in combinations or behind a strongly aligned base LLM. If the base model already refuses the prompt or a second guardrail alters the output signature, the match score has no way to isolate the target guardrail, yet those regimes are exactly what the introduction calls out. Without data on those cases the perfect accuracy stays limited to the simpler scenario. The method is aimed at researchers doing red-teaming or security analysis of deployed LLM agents. Someone already working on guardrail attacks or system fingerprinting could take the framework and apply it directly, though they would probably need to add their own stacked-guard tests. It is worth sending for peer review because the core idea is testable, the results are reported clearly where they apply, and referees can require the harder cases to be addressed before publication.

Referee Report

2 major / 0 minor

Summary. The paper introduces AP-Test, a black-box method to identify guardrails in LLM agents via guard-specific adversarial prompts. It uses complementary input and output guard tests plus a match-score metric to address challenges from safety-aligned base LLMs and stacked guardrails. Experiments on four open-source guardrails across diverse agents are claimed to yield perfect classification accuracy, supported by ablation studies.

Significance. If the method can isolate individual guardrails despite base-model alignment and co-deployed components, the work would supply a practical empirical technique for security analysis of deployed LLM agents. The explicit handling of decision-making via match score and the ablation studies are constructive elements that could support follow-on research on guardrail robustness.

major comments (2)

[Section 4] Section 4: The experiments report perfect classification accuracy, yet the setups test each guardrail in isolation rather than the multi-guardrail or heavily safety-aligned regimes explicitly identified as challenges in the introduction; this leaves the central claim that AP-Test handles interference untested.
[Abstract, Section 4] Abstract and Section 4: The claim of 'perfect classification accuracy in multiple scenarios' is presented without error analysis, confidence intervals, or explicit controls for confounds such as base-LLM refusals masking guardrail signatures, rendering the result impossible to evaluate for robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made.

read point-by-point responses

Referee: [Section 4] The experiments report perfect classification accuracy, yet the setups test each guardrail in isolation rather than the multi-guardrail or heavily safety-aligned regimes explicitly identified as challenges in the introduction; this leaves the central claim that AP-Test handles interference untested.

Authors: We agree that our experiments primarily evaluated AP-Test on individual guardrails deployed in isolation across different agents. Although the introduction discusses challenges from safety-aligned base LLMs and stacked guardrails, the reported results do not include explicit tests for multi-guardrail interference. We will revise the manuscript to clearly delineate the experimental scope, acknowledge this as a limitation, and outline future work on more complex deployments. revision: yes
Referee: [Abstract, Section 4] The claim of 'perfect classification accuracy in multiple scenarios' is presented without error analysis, confidence intervals, or explicit controls for confounds such as base-LLM refusals masking guardrail signatures, rendering the result impossible to evaluate for robustness.

Authors: We acknowledge the absence of statistical error analysis and confidence intervals in the presentation of our perfect accuracy results. The experiments involved a fixed set of adversarial prompts for each guardrail, yielding consistent identification via the match score. We will update the abstract and Section 4 to include the number of trials conducted, any observed variability, and additional details on how the input and output tests control for base model refusals to better demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical identification method with experimental validation

full rationale

The paper introduces AP-Test as an empirical procedure that crafts guard-specific adversarial prompts, applies input/output tests, and computes a match score to classify deployed guardrails. No equations, derivations, or parameter-fitting steps appear in the described method or abstract. The central claim rests on reported classification accuracies from experiments on four open-source guardrails rather than any self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain. The skeptic concern about multi-guardrail regimes concerns experimental coverage and assumption validity, not circularity in the derivation logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; ledger left minimal.

pith-pipeline@v0.9.0 · 5715 in / 851 out tokens · 17723 ms · 2026-05-23T03:42:44.379651+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 10 internal anchors

[1]

1, 2, 5, 11

https://openai.com/index/hello-gpt-4o/. 1, 2, 5, 11

work page
[2]

https://claude.ai/. 1

work page
[3]

https://chat.deepseek.com/. 1

work page
[4]

https://chat.openai.com/chat. 1, 6

work page
[5]

1, 2, 5, 11

https://www.perspectiveapi.com. 1, 2, 5, 11

work page
[6]

https://policies.google.com/terms/generative- ai/use-policy?hl=en. 2

work page
[7]

https://ai.meta.com/llama/use-policy/. 2

work page
[8]

https://openai.com/policies/usage-policies. 2

work page
[9]

https://aws.amazon.com/cn/machine-learning/ responsible-ai/policy/. 2

work page
[10]

https://platform.openai.com/docs/guides/ moderation/overview. 2

work page
[11]

3, 5, 11

https://www.llama.com/docs/model-cards-and- prompt-formats/meta-llama-guard-2/ . 3, 5, 11

work page
[12]

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations. CoRR abs/2411.10414,

work page arXiv
[13]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agent- Dojo: A Dynamic Environment to Evaluate Attacks and De- fenses for LLM Agents. CoRR abs/2406.13352, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Fin- gerprinting Fine-tuned Language Models in the Wild

Nirav Diwan, Tanmoy Chakraborty, and Zubair Shafiq. Fin- gerprinting Fine-tuned Language Models in the Wild. In An- nual Meeting of the Association for Computational Linguis- tics and International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 4652–4664. ACL, 2021. 3

work page 2021
[15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Betha...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts. CoRR abs/2404.05993, 2024. 3, 5, 11

work page arXiv 2024
[17]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tian- shuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts. CoRR abs/2311.05608, 2023. 2

work page arXiv 2023
[18]

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification. In Annual Meeting of the Association for Computational Linguistics (ACL) . ACL,

work page
[19]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. CoRR abs/2406.18495, 2024. 1, 2, 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversa- tions. CoRR abs/2312.06674, 2023. 1, 2, 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

BeaverTails: Towards Improved Safety Align- ment of LLM via a Human-Preference Dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards Improved Safety Align- ment of LLM via a Human-Preference Dataset. In An- nual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 2

work page 2023
[22]

Thomas Hou

Heng Jin, Chaoyu Zhang, Shanghao Shi, Wenjing Lou, and Y . Thomas Hou. ProFLingo: A Fingerprinting-based Copy- right Protection Scheme for Large Language Models. CoRR abs/2405.02466, 2024. 3, 4, 5

work page arXiv 2024
[23]

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 2

work page arXiv 2023
[24]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451 , 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In USENIX Security Symposium (USENIX Security). USENIX, 2024. 1, 3

work page 2024
[26]

Safety Alignment for Vision Language Models

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. Safety Alignment for Vision Language Models. CoRR abs/2405.13581, 2024. 2

work page arXiv 2024
[27]

HateX- plain: A Benchmark Dataset for Explainable Hate Speech De- tection

Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. HateX- plain: A Benchmark Dataset for Explainable Hate Speech De- tection. In AAAI Conference on Artificial Intelligence (AAAI), pages 14867–14875. AAAI, 2021. 4

work page 2021
[28]

Your Large Language Models Are Leaving Fingerprints

Hope McGovern, Rickard Stureborg, Yoshi Suhara, and Dim- itris Alikaniotis. Your Large Language Models Are Leaving Fingerprints. CoRR abs/2405.14057, 2024. 3

work page arXiv 2024
[29]

Tree of Attacks: Jailbreaking Black-Box LLMs Automati- cally

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automati- cally. CoRR abs/2312.02119, 2023. 2

work page arXiv 2023
[30]

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Riv- ière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

An LLM-Driven Chatbot in Higher Education for Databases and Information Systems

Alexander Neumann, Yue Yin, Sulayman K Sowe, Stefan Decker, and Matthias Jarke. An LLM-Driven Chatbot in Higher Education for Databases and Information Systems. IEEE Transactions on Education, 2024. 1

work page 2024
[32]

Smith, Nima M

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima M. Pournejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria P. Lipori, Du- ane A. Mitchell, Naykky Singh Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. A study of generative large lan- guage model for medical ...

work page 2023
[33]

A Survey on Hate Speech Detection using Natural Language Processing

Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detection using Natural Language Processing. In Workshop on Natural Language Processing for Social Media (SocialNLP), pages 1–10. ACL, 2017. 4

work page 2017
[34]

Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els. In ACM SIGSAC Conference on Computer and Commu- nications Security (CCS). ACM, 2024. 1, 2, 3

work page 2024
[35]

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Cam- paigns

Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas Zannettou, and Yang Zhang. HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Cam- paigns. In USENIX Security Symposium (USENIX Security) . USENIX, 2025. 1

work page 2025
[36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and Efficient Foundation Language Mod- els. CoRR abs/2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. In Annual Meet- ing of the Association for Computational Linguistics and In- ternational Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 1667–1682. ACL, 2021. 1, 4, 5

work page 2021
[38]

Mem- bership Inference Attacks Against In-Context Learning

Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. Mem- bership Inference Attacks Against In-Context Learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024. 2

work page 2024
[39]

Assessing Prompt Injection Risks in 200+ Custom GPTs

Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, and Xinyu Xing. Assessing Prompt Injection Risks in 200+ Custom GPTs. CoRR abs/2311.11538, 2023. 1, 3

work page arXiv 2023
[40]

HuRef: HUman-REadable Fingerprint for Large Language Models

Boyi Zeng, Chenghu Zhou, Xinbing Wang, and Zhouhan Lin. HuRef: HUman-REadable Fingerprint for Large Language Models. CoRR abs/2312.04828, 2023. 3

work page arXiv 2023
[41]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew 9 Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Stur- man, and Oscar Wahltinez. ShieldGemma: Genera- tive AI Content Moderation Based on Gemma. CoRR abs/2407.21772, 2024. 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. CoRR abs/2403.02691, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Instruction Backdoor Attacks Against Customized LLMs

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction Backdoor Attacks Against Customized LLMs. In USENIX Security Symposium (USENIX Security). USENIX, 2024. 2

work page 2024
[44]

Weak-to-strong jailbreaking on large language models,

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak- to-Strong Jailbreaking on Large Language Models. CoRR abs/2401.17256, 2024. 1, 2, 3

work page arXiv 2024
[45]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023. 1, 2, 3 A Experimental Settings Table 7 shows the details of the models used in our experi- ments, including the versions we used. Table 8 illustrates the five candidate prompt templates used in t...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

1, 2, 5, 11

https://openai.com/index/hello-gpt-4o/. 1, 2, 5, 11

work page

[2] [2]

https://claude.ai/. 1

work page

[3] [3]

https://chat.deepseek.com/. 1

work page

[4] [4]

https://chat.openai.com/chat. 1, 6

work page

[5] [5]

1, 2, 5, 11

https://www.perspectiveapi.com. 1, 2, 5, 11

work page

[6] [6]

https://policies.google.com/terms/generative- ai/use-policy?hl=en. 2

work page

[7] [7]

https://ai.meta.com/llama/use-policy/. 2

work page

[8] [8]

https://openai.com/policies/usage-policies. 2

work page

[9] [9]

https://aws.amazon.com/cn/machine-learning/ responsible-ai/policy/. 2

work page

[10] [10]

https://platform.openai.com/docs/guides/ moderation/overview. 2

work page

[11] [11]

3, 5, 11

https://www.llama.com/docs/model-cards-and- prompt-formats/meta-llama-guard-2/ . 3, 5, 11

work page

[12] [12]

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations. CoRR abs/2411.10414,

work page arXiv

[13] [13]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agent- Dojo: A Dynamic Environment to Evaluate Attacks and De- fenses for LLM Agents. CoRR abs/2406.13352, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Fin- gerprinting Fine-tuned Language Models in the Wild

Nirav Diwan, Tanmoy Chakraborty, and Zubair Shafiq. Fin- gerprinting Fine-tuned Language Models in the Wild. In An- nual Meeting of the Association for Computational Linguis- tics and International Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 4652–4664. ACL, 2021. 3

work page 2021

[15] [15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Betha...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts. CoRR abs/2404.05993, 2024. 3, 5, 11

work page arXiv 2024

[17] [17]

Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tian- shuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Fig- 8 Step: Jailbreaking Large Vision-language Models via Typo- graphic Visual Prompts. CoRR abs/2311.05608, 2023. 2

work page arXiv 2023

[18] [18]

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification. In Annual Meeting of the Association for Computational Linguistics (ACL) . ACL,

work page

[19] [19]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. CoRR abs/2406.18495, 2024. 1, 2, 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversa- tions. CoRR abs/2312.06674, 2023. 1, 2, 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

BeaverTails: Towards Improved Safety Align- ment of LLM via a Human-Preference Dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards Improved Safety Align- ment of LLM via a Human-Preference Dataset. In An- nual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023. 2

work page 2023

[22] [22]

Thomas Hou

Heng Jin, Chaoyu Zhang, Shanghao Shi, Wenjing Lou, and Y . Thomas Hou. ProFLingo: A Fingerprinting-based Copy- right Protection Scheme for Large Language Models. CoRR abs/2405.02466, 2024. 3, 4, 5

work page arXiv 2024

[23] [23]

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 2

work page arXiv 2023

[24] [24]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Au- toDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451 , 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In USENIX Security Symposium (USENIX Security). USENIX, 2024. 1, 3

work page 2024

[26] [26]

Safety Alignment for Vision Language Models

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. Safety Alignment for Vision Language Models. CoRR abs/2405.13581, 2024. 2

work page arXiv 2024

[27] [27]

HateX- plain: A Benchmark Dataset for Explainable Hate Speech De- tection

Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. HateX- plain: A Benchmark Dataset for Explainable Hate Speech De- tection. In AAAI Conference on Artificial Intelligence (AAAI), pages 14867–14875. AAAI, 2021. 4

work page 2021

[28] [28]

Your Large Language Models Are Leaving Fingerprints

Hope McGovern, Rickard Stureborg, Yoshi Suhara, and Dim- itris Alikaniotis. Your Large Language Models Are Leaving Fingerprints. CoRR abs/2405.14057, 2024. 3

work page arXiv 2024

[29] [29]

Tree of Attacks: Jailbreaking Black-Box LLMs Automati- cally

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automati- cally. CoRR abs/2312.02119, 2023. 2

work page arXiv 2023

[30] [30]

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Riv- ière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

An LLM-Driven Chatbot in Higher Education for Databases and Information Systems

Alexander Neumann, Yue Yin, Sulayman K Sowe, Stefan Decker, and Matthias Jarke. An LLM-Driven Chatbot in Higher Education for Databases and Information Systems. IEEE Transactions on Education, 2024. 1

work page 2024

[32] [32]

Smith, Nima M

Cheng Peng, Xi Yang, Aokun Chen, Kaleb E. Smith, Nima M. Pournejatian, Anthony B. Costa, Cheryl Martin, Mona G. Flores, Ying Zhang, Tanja Magoc, Gloria P. Lipori, Du- ane A. Mitchell, Naykky Singh Ospina, Mustafa M. Ahmed, William R. Hogan, Elizabeth A. Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. A study of generative large lan- guage model for medical ...

work page 2023

[33] [33]

A Survey on Hate Speech Detection using Natural Language Processing

Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detection using Natural Language Processing. In Workshop on Natural Language Processing for Social Media (SocialNLP), pages 1–10. ACL, 2017. 4

work page 2017

[34] [34]

Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluat- ing In-The-Wild Jailbreak Prompts on Large Language Mod- els. In ACM SIGSAC Conference on Computer and Commu- nications Security (CCS). ACM, 2024. 1, 2, 3

work page 2024

[35] [35]

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Cam- paigns

Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas Zannettou, and Yang Zhang. HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Cam- paigns. In USENIX Security Symposium (USENIX Security) . USENIX, 2025. 1

work page 2025

[36] [36]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and Efficient Foundation Language Mod- els. CoRR abs/2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. In Annual Meet- ing of the Association for Computational Linguistics and In- ternational Joint Conference on Natural Language Processing (ACL/IJCNLP), pages 1667–1682. ACL, 2021. 1, 4, 5

work page 2021

[38] [38]

Mem- bership Inference Attacks Against In-Context Learning

Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. Mem- bership Inference Attacks Against In-Context Learning. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2024. 2

work page 2024

[39] [39]

Assessing Prompt Injection Risks in 200+ Custom GPTs

Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, and Xinyu Xing. Assessing Prompt Injection Risks in 200+ Custom GPTs. CoRR abs/2311.11538, 2023. 1, 3

work page arXiv 2023

[40] [40]

HuRef: HUman-REadable Fingerprint for Large Language Models

Boyi Zeng, Chenghu Zhou, Xinbing Wang, and Zhouhan Lin. HuRef: HUman-REadable Fingerprint for Large Language Models. CoRR abs/2312.04828, 2023. 3

work page arXiv 2023

[41] [41]

ShieldGemma: Generative AI Content Moderation Based on Gemma

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew 9 Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Stur- man, and Oscar Wahltinez. ShieldGemma: Genera- tive AI Content Moderation Based on Gemma. CoRR abs/2407.21772, 2024. 3, 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. CoRR abs/2403.02691, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Instruction Backdoor Attacks Against Customized LLMs

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction Backdoor Attacks Against Customized LLMs. In USENIX Security Symposium (USENIX Security). USENIX, 2024. 2

work page 2024

[44] [44]

Weak-to-strong jailbreaking on large language models,

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak- to-Strong Jailbreaking on Large Language Models. CoRR abs/2401.17256, 2024. 1, 2, 3

work page arXiv 2024

[45] [45]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023. 1, 2, 3 A Experimental Settings Table 7 shows the details of the models used in our experi- ments, including the versions we used. Table 8 illustrates the five candidate prompt templates used in t...

work page internal anchor Pith review Pith/arXiv arXiv 2023