Defending large language models against jailbreaking attacks through goal prioritization

Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, Minlie Huang · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

citing papers explorer

Showing 2 of 2 citing papers.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse cs.CR · 2026-04-19 · unverdicted · none · ref 5
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 31
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

Defending large language models against jailbreaking attacks through goal prioritization

fields

years

verdicts

representative citing papers

citing papers explorer