SDGO : Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Ding, Peng, Sun, Wen, Li, Dailin, Zou, Wei, Wang, Jiaming, Chen, Jiajun · 2025 · DOI 10.18653/v1/2025.emnlp-main.253

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

citing papers explorer

Showing 1 of 1 citing paper.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 10
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

SDGO : Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer