hub

Autodan: inter- pretable gradient-based adversarial attacks on large lan- guage models.arXiv preprint arXiv:2310.15140

· 2023 · arXiv 2310.15140

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 unclear 1

representative citing papers

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

cs.CR · 2026-04-08 · unverdicted · novelty 7.0

SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

Exploring the Secondary Risks of Large Language Models

cs.LG · 2025-06-14 · unverdicted · novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

cs.CR · 2025-04-29 · unverdicted · novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

Whispers in the Machine: Confidentiality in Agentic Systems

cs.CR · 2024-02-10 · unverdicted · novelty 6.0

Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

cs.CL · 2025-08-28 · unverdicted · novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

cs.LG · 2024-10-20 · unverdicted · novelty 5.0

Faster-GCG improves GCG efficiency 8x via regularization, temperature sampling, and duplicate avoidance, reaching 78.1% success rate with 32K evaluations across five aligned LLMs.

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

cs.AI · 2025-10-27 · unverdicted · novelty 4.0

A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.

Exploiting Web Search Tools of AI Agents for Data Exfiltration

cs.CR · 2025-10-10 · unverdicted · novelty 4.0

Indirect prompt injection attacks remain effective on LLMs using web search tools, allowing data exfiltration and exposing ongoing weaknesses in current model defenses.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

citing papers explorer

Showing 16 of 16 citing papers.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents cs.LG · 2026-05-09 · unverdicted · none · ref 20
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy across multiple agent types and models.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring cs.CR · 2026-05-09 · unverdicted · none · ref 29
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 17
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems cs.CR · 2026-04-08 · unverdicted · none · ref 20
SkillTrojan demonstrates that backdoors can be placed in composable skills of agent systems to achieve up to 97% attack success rate with only minor loss in clean-task accuracy.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 14
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 14 · 2 links
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 48
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 55
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction cs.CR · 2025-04-29 · unverdicted · none · ref 50
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
A StrongREJECT for Empty Jailbreaks cs.LG · 2024-02-15 · conditional · none · ref 43
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
Whispers in the Machine: Confidentiality in Agentic Systems cs.CR · 2024-02-10 · unverdicted · none · ref 10
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 25
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models cs.LG · 2024-10-20 · unverdicted · none · ref 12
Faster-GCG improves GCG efficiency 8x via regularization, temperature sampling, and duplicate avoidance, reaching 78.1% success rate with 32K evaluations across five aligned LLMs.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 67
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
Exploiting Web Search Tools of AI Agents for Data Exfiltration cs.CR · 2025-10-10 · unverdicted · none · ref 7
Indirect prompt injection attacks remain effective on LLMs using web search tools, allowing data exfiltration and exposing ongoing weaknesses in current model defenses.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 124
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Autodan: inter- pretable gradient-based adversarial attacks on large lan- guage models.arXiv preprint arXiv:2310.15140

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer