RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
hub Canonical reference
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention.
hub tools
citation-role summary
citation-polarity summary
roles
background 9representative citing papers
Interviews with 28 AIG-SC creators show motivations spanning sexual exploration, creative expression, technical experimentation, and occasional production of non-consensual intimate imagery.
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
Exclusive Unlearning makes LLMs safe by forgetting all but retained domain knowledge, protecting against jailbreaks while preserving useful responses in areas like medicine and math.
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
TRACE-RPS drops LLM attribute inference accuracy from around 50% to below 5% via fine-grained anonymization plus a two-stage rejection optimization.
OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.
Presents GradingAttack with token- and prompt-level adversarial attacks that compromise LLM educational grading agents on multiple datasets, showing prompt-level attacks succeed more while token-level are stealthier.
RSA prompting enables LLMs to automatically create functional exploits for CVEs in Odoo ERP, succeeding on all tested cases in 3-5 rounds and removing the need for manual effort.
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
LLMs exhibit identity-dependent hedging on human rights questions, with group identity as the strongest predictor among tested factors, and group steering mitigates the disparity.
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
citing papers explorer
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.