pith. sign in

arxiv: 2605.27110 · v1 · pith:AS6U42SGnew · submitted 2026-05-26 · 💻 cs.CR · cs.CL

BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning

classification 💻 cs.CR cs.CL
keywords baitdisclosuremodelboundaryescalationfirstjailbreakreasoning
0
0 comments X
read the original abstract

In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model's previous responses, BAIT turns the model's own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.