THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

· 2026 · cs.CL · arXiv 2606.01738

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

representative citing papers

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

cs.CR · 2026-07-01 · unverdicted · novelty 5.0

Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.

citing papers explorer

Showing 1 of 1 citing paper.

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety cs.CR · 2026-07-01 · unverdicted · none · ref 11 · internal anchor
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer