One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

· 2026 · cs.CL · arXiv 2605.05630

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

representative citing papers

Investigating and Alleviating Harm Amplification in LLM Interactions

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

cs.CR · 2026-07-01 · unverdicted · novelty 5.0

Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.

citing papers explorer

Showing 2 of 2 citing papers.

Investigating and Alleviating Harm Amplification in LLM Interactions cs.CL · 2026-06-01 · unverdicted · none · ref 47 · internal anchor
Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.
Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety cs.CR · 2026-07-01 · unverdicted · none · ref 27 · internal anchor
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

fields

years

verdicts

representative citing papers

citing papers explorer