hub Canonical reference

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang · 2023 · cs.LG · arXiv 2309.00614

Canonical reference. 75% of citing Pith papers cite this work as background.

48 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 2 method 1

citation-polarity summary

background 9 baseline 2 use method 1

representative citing papers

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

cs.CR · 2026-05-28 · unverdicted · novelty 7.0

MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

cs.CR · 2026-05-20 · conditional · novelty 7.0

Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

cs.LG · 2026-05-19 · conditional · novelty 7.0

CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.

BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

cs.AI · 2026-05-12 · conditional · novelty 7.0

BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

cs.CR · 2026-05-08 · conditional · novelty 7.0

A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Prompt Injection Attack to Tool Selection in LLM Agents

cs.CR · 2025-04-28 · conditional · novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.

ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.

Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

cs.CR · 2026-04-23 · unverdicted · novelty 6.0

BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

cs.CR · 2026-04-22 · unverdicted · novelty 6.0

SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.

An AI Agent Execution Environment to Safeguard User Data

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.

How Adversarial Environments Mislead Agentic AI?

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

citing papers explorer

Showing 48 of 48 citing papers.

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening cs.CR · 2026-05-27 · unverdicted · none · ref 13 · internal anchor
Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction cs.CR · 2026-05-28 · unverdicted · none · ref 24 · internal anchor
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs cs.CR · 2026-05-20 · conditional · none · ref 71 · internal anchor
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes cs.LG · 2026-05-19 · conditional · none · ref 4 · internal anchor
CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexity baselines across six open-weight chat models.
BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts cs.AI · 2026-05-12 · conditional · none · ref 49 · internal anchor
BadSKP poisons graph node embeddings to steer soft prompts in KG-enhanced LLMs, achieving high attack success rates where text-channel backdoors fail due to semantic anchoring.
The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play cs.AI · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2.5 models.
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration cs.CR · 2026-05-08 · conditional · none · ref 22 · internal anchor
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Attention Is Where You Attack cs.CR · 2026-04-30 · unverdicted · none · ref 8 · internal anchor
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF cs.CL · 2026-04-20 · unverdicted · none · ref 53 · internal anchor
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 36 · internal anchor
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 25 · internal anchor
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 26 · internal anchor
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks cs.CR · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs cs.CR · 2026-05-15 · unverdicted · none · ref 38 · internal anchor
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection cs.CR · 2026-05-05 · unverdicted · none · ref 129 · internal anchor
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 18 · internal anchor
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 23 · internal anchor
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions cs.CR · 2026-05-01 · unverdicted · none · ref 14 · internal anchor
SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems cs.CR · 2026-05-01 · unverdicted · none · ref 26 · internal anchor
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/energy reductions on testbed workloads.
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers cs.CR · 2026-04-23 · unverdicted · none · ref 49 · internal anchor
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs cs.CR · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 27 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification cs.CR · 2026-04-11 · unverdicted · none · ref 19 · internal anchor
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 15 · internal anchor
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation cs.AI · 2026-03-18 · unverdicted · none · ref 16 · internal anchor
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 17 · internal anchor
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 13 · internal anchor
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems cs.CR · 2025-06-03 · unverdicted · none · ref 38 · internal anchor
Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models cs.CL · 2024-10-05 · unverdicted · none · ref 17 · internal anchor
ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models cs.CR · 2024-03-28 · accept · none · ref 20 · internal anchor
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.
Whispers in the Machine: Confidentiality in Agentic Systems cs.CR · 2024-02-10 · unverdicted · none · ref 54 · internal anchor
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
Jailbreaking Black Box Large Language Models in Twenty Queries cs.LG · 2023-10-12 · conditional · none · ref 37 · internal anchor
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 35 · internal anchor
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection cs.CR · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models cs.CR · 2025-10-23 · unverdicted · none · ref 11 · internal anchor
SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 51 · internal anchor
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG cs.CR · 2025-06-04 · unverdicted · none · ref 20 · internal anchor
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 52 · internal anchor
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models cs.LG · 2024-10-20 · unverdicted · none · ref 3 · internal anchor
Faster-GCG improves GCG efficiency 8x via regularization, temperature sampling, and duplicate avoidance, reaching 78.1% success rate with 32K evaluations across five aligned LLMs.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 66 · internal anchor
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 55 · internal anchor
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 37 · internal anchor
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 68 · internal anchor
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 74 · internal anchor
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits cs.CR · 2026-04-01 · unreviewed · ref 11 · internal anchor

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer