Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks

Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho · 2024 · arXiv 2405.20099

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

citing papers explorer

Showing 1 of 1 citing paper.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 34
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

Defensive prompt patch: A robust and interpretable defense of llms against jailbreak attacks

fields

years

verdicts

representative citing papers

citing papers explorer