We used the original prompts as the test set to evaluate LLM refusal behaviors

· 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

citing papers explorer

Showing 1 of 1 citing paper.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models cs.AI · 2025-10-24 · unverdicted · none · ref 21
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

We used the original prompts as the test set to evaluate LLM refusal behaviors

fields

years

verdicts

representative citing papers

citing papers explorer