Selfdefend: Llms can defend themselves against jailbreaking in a practical manner,

· 2024 · arXiv 2406.05498

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

cs.CR · 2026-04-20 · unverdicted · novelty 7.0 · 2 refs

The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

citing papers explorer

Showing 3 of 3 citing papers.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring cs.CR · 2026-05-09 · unverdicted · none · ref 26
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection cs.CR · 2026-04-20 · unverdicted · none · ref 23 · 2 links
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 103
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Selfdefend: Llms can defend themselves against jailbreaking in a practical manner,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer