Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
super hub Mixed citations
Universal and Transferable Adversarial Attacks on Aligned Language Models
Mixed citation behavior. Most common role is background (65%).
abstract
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached
authors
co-cited works
representative citing papers
Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.
MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.
This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
citing papers explorer
-
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
-
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models
SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.
-
Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content
Log-substrate prompt injection via attacker-controlled fields enables effective attacks on LLM SOC assistants, with persona hijacks suppressing 68% of malicious logs and context manipulation reaching 96% success on summarization, reduced to 11.8% average under strongest defenses.
-
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
-
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
-
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.
-
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
-
Measuring Safety Alignment Effects in Autonomous Security Agents
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
-
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
-
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
-
Widening the Gap: Exploiting LLM Quantization via Outlier Injection
The paper introduces an outlier-injection attack that induces targeted weight collapse in LLMs under advanced quantization schemes including AWQ, GPTQ, and GGUF I-quants.
-
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
-
One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.
-
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
-
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-dependent harms.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
-
Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses
SADBench is a new benchmark that systematically tests steganography attacks with harmful image and text payloads against steganalysis defenses, revealing stable attack methods, near-perfect in-domain detection, transferability asymmetry favoring attacks, and persistent real-world threats on social媒体
-
On the Hardness of Junking LLMs
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
-
MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
-
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
Behavioral governance of AI effects is undecidable for Turing-complete systems, so systems must achieve coterminous governance by separating computation from effects rather than adding governance afterward.
-
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based white-box attacks on harmful-behavior benchmarks.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
-
Jailbreaking Frontier Foundation Models Through Intention Deception
A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
-
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
-
Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failing on DOM injection.
-
Duality for the Adversarial Total Variation
Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning performance within 1.5 points.
-
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
-
HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.