TRACE improves RL-based multi-turn jailbreaking by using leave-one-turn-out masking for successful trajectories and harmfulness-based penalties for failed ones, achieving roughly 25% higher attack success rates than prior RL baselines.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 2representative citing papers
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.
citing papers explorer
-
Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking
TRACE improves RL-based multi-turn jailbreaking by using leave-one-turn-out masking for successful trajectories and harmfulness-based penalties for failed ones, achieving roughly 25% higher attack success rates than prior RL baselines.
-
TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards
TROJail improves multi-turn LLM jailbreak success rates by framing attacks as trajectory optimization in RL and adding process rewards that penalize early refusals while steering semantic relevance to the target harm.