super hub Mixed citations

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, J Zico Kolter, Milad Nasr, Nicholas Carlini, Zifan Wang · 2023 · cs.CL · arXiv 2307.15043

Mixed citation behavior. Most common role is background (65%).

413 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 413 citing papers more from Andy Zou arXiv PDF

abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 dataset 6 method 5 baseline 2 other 2

citation-polarity summary

background 34 use dataset 6 unclear 4 use method 4 baseline 2 support 2

claims ledger

abstract Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached

authors

and Matt Fredrikson Andy Zou J Zico Kolter Milad Nasr Nicholas Carlini Zifan Wang

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs

cs.CR · 2026-05-30 · unverdicted · novelty 8.0

Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

cs.CR · 2026-05-24 · unverdicted · novelty 8.0

MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.

Who Owns This Agent? Tracing AI Agents Back to Their Owners

cs.CR · 2026-05-15 · unverdicted · novelty 8.0

A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

cs.CR · 2026-05-11 · conditional · novelty 8.0

LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

cs.CR · 2026-04-09 · unverdicted · novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

cs.CR · 2026-04-03 · accept · novelty 8.0

Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

cs.CL · 2026-03-17 · conditional · novelty 8.0

Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.

A First Look at the Security Issues in the Model Context Protocol Ecosystem

cs.CR · 2025-10-18 · conditional · novelty 8.0

Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem

cs.CR · 2025-09-08 · unverdicted · novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

cs.CL · 2023-08-02 · conditional · novelty 8.0

XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.

citing papers explorer

Showing 50 of 413 citing papers.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs cs.CY · 2026-06-27 · unverdicted · none · ref 27 · internal anchor
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
Confused ChatGPT: Cross-App Context Poisoning via First-Party APIs cs.CR · 2026-05-30 · unverdicted · none · ref 43 · internal anchor
Identifies cross-app context poisoning in ChatGPT Apps, a persistent indirect prompt injection delivered through undocumented first-party API parameters that lets one app manipulate others via the shared untagged context.
MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning cs.CR · 2026-05-24 · unverdicted · none · ref 4 · internal anchor
MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.
Who Owns This Agent? Tracing AI Agents Back to Their Owners cs.CR · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models cs.CR · 2026-05-14 · conditional · none · ref 57 · internal anchor
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution cs.CR · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments cs.CR · 2026-05-11 · conditional · none · ref 48 · internal anchor
LITMUS is the first benchmark using semantic-physical dual verification and OS state rollback to measure behavioral jailbreaks in LLM agents, revealing that even strong models execute 40%+ of high-risk operations and exhibit execution hallucination.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs cs.CR · 2026-04-17 · conditional · none · ref 30 · internal anchor
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 82 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 70 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain cs.CR · 2026-04-09 · unverdicted · none · ref 54 · internal anchor
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 23 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 53 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis cs.CR · 2026-04-03 · accept · none · ref 39 · internal anchor
Agent Skills has structural security weaknesses from missing data-instruction boundaries, single-approval persistent trust, and absent marketplace reviews that require fundamental redesign.
Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models cs.CL · 2026-03-17 · conditional · none · ref 10 · internal anchor
Re-masking committed refusal tokens plus compliance prefixes bypasses safety in diffusion language models at 74-98% success across tested models.
A First Look at the Security Issues in the Model Context Protocol Ecosystem cs.CR · 2025-10-18 · conditional · none · ref 60 · internal anchor
Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.
Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem cs.CR · 2025-09-08 · unverdicted · none · ref 57 · internal anchor
This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers revealing systemic vulnerabilities from missing isolation and least-privilege in the
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems cs.MA · 2024-10-09 · unverdicted · none · ref 98 · internal anchor
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 73 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models cs.CL · 2023-08-02 · conditional · none · ref 5 · internal anchor
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 276 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing cs.AI · 2026-06-29 · unverdicted · none · ref 52 · internal anchor
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense cs.CR · 2026-06-28 · unverdicted · none · ref 12 · internal anchor
Response-time linear probing on first generated tokens detects prefilling attacks missed by prompt-time activation defenses, achieving 0/40 attack success and 0% false positives across seven models while composing orthogonally with AlphaSteer.
Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies cs.LG · 2026-06-28 · unverdicted · none · ref 10 · internal anchor
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models cs.CR · 2026-06-25 · unverdicted · none · ref 22 · internal anchor
Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.
Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers cs.LG · 2026-06-10 · unverdicted · none · ref 53 · internal anchor
An online KS-statistic monitor detects shifts in deployed safety classifiers with 86.6% valid detection rate, exposes conformal prediction collapse in high-dimensional embeddings, and derives a confidence-gated security boundary against adaptive attackers.
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks cs.CR · 2026-06-04 · unverdicted · none · ref 39 · internal anchor
SlotGCG uses Vulnerable Slot Score (VSS) to identify and target the most vulnerable prompt positions for adversarial token insertion, delivering 14% higher ASR than standard GCG and 42% higher against defenses.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 82 · internal anchor
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs cs.CL · 2026-06-03 · unverdicted · none · ref 36 · internal anchor
Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.
$\pi$Creds: Privately Inferred Credentials cs.CR · 2026-06-02 · unverdicted · none · ref 61 · internal anchor
πCreds produces privacy-preserving verifiable credentials via trusted LLM inference on authenticated data, expanding claim types to unstructured sources and formalizing SCAE and ACPP threat models.
ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation cs.CR · 2026-06-02 · unverdicted · none · ref 42 · internal anchor
ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.
RogueMerge: Robust and Unified Attacks against LLM Model Merging cs.CR · 2026-06-02 · unverdicted · none · ref 18 · internal anchor
RogueMerge is a unified attack method that jointly optimizes task vectors to succeed after merging, using stochastic min-max simulation for unknown merging settings and a Taylor-approximated DRO for prompt generalization on generative LLMs.
Gate AI: LLM Security Benchmark Evaluation Methodology and Results cs.LG · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
Introduces a cross-validation-based evaluation methodology for LLM security detectors using a global threshold and group-fold leakage checks to avoid per-dataset tuning.
MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models cs.CR · 2026-06-01 · unverdicted · none · ref 56 · internal anchor
MaskForge reaches 79.3% average attack success rate on five dLLMs by adaptively searching and accumulating structural attack patterns with a UCB bandit, improving 17.6% over baselines and transferring to 88.2% on AdvBench.
What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents cs.CR · 2026-06-01 · unverdicted · none · ref 8 · internal anchor
The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models cs.CR · 2026-05-29 · unverdicted · none · ref 15 · internal anchor
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 33 · internal anchor
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
AIRGuard: Guarding Agent Actions with Runtime Authority Control cs.CR · 2026-05-27 · unverdicted · none · ref 32 · internal anchor
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions cs.LG · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models cs.CL · 2026-05-25 · unverdicted · none · ref 15 · internal anchor
SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.
Poisoning the Watchtower: Prompt Injection Attacks Against LLM-Augmented Security Operations Through Adversarial Log Content cs.CR · 2026-05-23 · unverdicted · none · ref 10 · internal anchor
Log-substrate prompt injection via attacker-controlled fields enables effective attacks on LLM SOC assistants, with persona hijacks suppressing 68% of malicious logs and context manipulation reaching 96% success on summarization, reduced to 11.8% average under strongest defenses.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 102 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 102 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs cs.CR · 2026-05-20 · conditional · none · ref 12 · internal anchor
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 18 · 2 links · internal anchor
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 1 · internal anchor
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Universal and Transferable Adversarial Attacks on Aligned Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer