super hub Mixed citations

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, J Zico Kolter, Milad Nasr, Nicholas Carlini, Zifan Wang · 2023 · cs.CL · arXiv 2307.15043

Mixed citation behavior. Most common role is background (65%).

337 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 337 citing papers more from Andy Zou arXiv PDF

abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 dataset 6 method 5 baseline 2 other 2

citation-polarity summary

background 34 use dataset 6 unclear 4 use method 4 baseline 2 support 2

claims ledger

abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached

authors

and Matt Fredrikson Andy Zou J Zico Kolter Milad Nasr Nicholas Carlini Zifan Wang

co-cited works

representative citing papers

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

cs.CR · 2026-04-03 · accept · novelty 8.0

Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

cs.CL · 2026-03-17 · conditional · novelty 8.0

Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.

A First Look at the Security Issues in the Model Context Protocol Ecosystem

cs.CR · 2025-10-18 · conditional · novelty 8.0

Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

cs.CR · 2025-09-08 · unverdicted · novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

cs.CL · 2023-08-02 · conditional · novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

AIRGuard: Guarding Agent Actions with Runtime Authority Control

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

citing papers explorer

Showing 50 of 337 citing papers.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs cs.CR · 2026-05-30 · unverdicted · none · ref 43 · internal anchor
Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.
Who Owns This Agent? Tracing AI Agents Back to Their Owners cs.CR · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 57 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution cs.CR · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments cs.CR · 2026-05-11 · conditional · none · ref 48 · internal anchor
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs cs.CR · 2026-04-17 · conditional · none · ref 30 · internal anchor
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 82 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 70 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain cs.CR · 2026-04-09 · unverdicted · none · ref 54 · internal anchor
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 23 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 53 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis cs.CR · 2026-04-03 · accept · none · ref 39 · internal anchor
Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models cs.CL · 2026-03-17 · conditional · none · ref 10 · internal anchor
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
A First Look at the Security Issues in the Model Context Protocol Ecosystem cs.CR · 2025-10-18 · conditional · none · ref 60 · internal anchor
Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.
Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem cs.CR · 2025-09-08 · unverdicted · none · ref 57 · internal anchor
This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems cs.MA · 2024-10-09 · unverdicted · none · ref 98 · internal anchor
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 73 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models cs.CL · 2023-08-02 · conditional · none · ref 5 · internal anchor
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 33 · internal anchor
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
AIRGuard: Guarding Agent Actions with Runtime Authority Control cs.CR · 2026-05-27 · unverdicted · none · ref 32 · internal anchor
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions cs.LG · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 102 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs cs.CR · 2026-05-20 · conditional · none · ref 12 · internal anchor
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 18 · 2 links · internal anchor
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 1 · internal anchor
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes cs.LG · 2026-05-19 · conditional · none · ref 3 · internal anchor
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 55 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Measuring Safety Alignment Effects in Autonomous Security Agents cs.CR · 2026-05-19 · conditional · none · ref 14 · internal anchor
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing cs.LG · 2026-05-18 · unverdicted · none · ref 20 · internal anchor
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation cs.CR · 2026-05-15 · unverdicted · none · ref 85 · internal anchor
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs cs.CL · 2026-05-13 · conditional · none · ref 37 · internal anchor
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 33 · internal anchor
DisaBench supplies a participatory taxonomy of twelve disability harm types, paired benign-adversarial prompts across seven life domains, and human-annotated data showing that standard safety tests miss context-dependent harms.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 58 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts cs.AI · 2026-05-12 · conditional · none · ref 22 · internal anchor
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems cs.CR · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection cs.CR · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck cs.CR · 2026-05-11 · unverdicted · none · ref 35 · internal anchor
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets cs.CR · 2026-05-10 · unverdicted · none · ref 66 · internal anchor
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success cs.CR · 2026-05-09 · accept · none · ref 3 · internal anchor
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off cs.CR · 2026-05-09 · unverdicted · none · ref 3 · internal anchor
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization cs.CV · 2026-05-08 · unverdicted · none · ref 33 · 2 links · internal anchor
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 57 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation cs.LG · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses cs.CR · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
SADBench is a new benchmark that systematically tests steganography attacks with harmful image and text payloads against steganalysis defenses, revealing stable attack methods, near-perfect in-domain detection, transferability asymmetry favoring attacks, and persistent real-world threats on social媒体
On the Hardness of Junking LLMs cs.LG · 2026-05-06 · unverdicted · none · ref 65 · internal anchor
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking cs.CR · 2026-05-01 · unverdicted · none · ref 54 · internal anchor
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.

Universal and Transferable Adversarial Attacks on Aligned Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer