D-Judge applies semantics-preserving output rewriting, trained via SFT and DPO on paired responses that differ in judge scores, to disrupt multi-turn jailbreak refinement loops and reduce attack success on HarmBench.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
D-Judge applies semantics-preserving output rewriting, trained via SFT and DPO on paired responses that differ in judge scores, to disrupt multi-turn jailbreak refinement loops and reduce attack success on HarmBench.