Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Jiatong Li; Jingyu Peng; Kai Zhang; Maolin Wang; Nan Wang; Pengyue Jia; Wanyu Wang; Xiangyu Zhao; Yuchen Li; Yuyang Ye

arxiv: 2505.13527 · v4 · submitted 2025-05-18 · 💻 cs.CL · cs.AI

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Jingyu Peng , Maolin Wang , Nan Wang , Jiatong Li , Yuchen Li , Yuyang Ye , Wanyu Wang , Pengyue Jia

show 2 more authors

Kai Zhang Xiangyu Zhao

This is my paper

Pith reviewed 2026-05-22 15:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM jailbreaklogical expressionssafety alignmentdistributional gapblack-box attackmultilingual evaluation

0 comments

The pith

Converting harmful prompts to formal logical expressions bypasses LLM safety filters by exploiting gaps in alignment training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LogiBreak as a black-box method that rewrites natural-language harmful requests into formal logical expressions. This change creates inputs whose distribution differs from the natural-language data used to train safety alignments, allowing the underlying intent to reach the model while avoiding refusal triggers. The approach keeps the original meaning and readability intact across languages. If correct, it shows that current safety layers rest on a narrow slice of possible input forms rather than a deeper understanding of harm.

Core claim

LogiBreak translates harmful natural language prompts into formal logical expressions to circumvent LLM safety systems. It exploits the distributional gap between alignment data and logic-based inputs while preserving the underlying semantic intent and readability, and it demonstrates effectiveness on a multilingual jailbreak dataset spanning three languages.

What carries the argument

Translation of harmful prompts into formal logical expressions that sit outside the distribution of safety-aligned training examples.

If this is right

The attack succeeds as a universal black-box method without needing model internals.
Effectiveness holds across multiple languages and evaluation settings.
Semantic intent and readability of the original harmful request remain intact after translation.
The same distributional gap can be created by other structured input formats beyond logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety training sets should be expanded to cover formal logical and other structured representations to close this gap.
The vulnerability may extend to any input format that was rare or absent during alignment.
Testing models on diverse syntactic forms could serve as a practical robustness check before deployment.

Load-bearing premise

Safety mechanisms fail mainly because of distributional differences between the prompts used in alignment training and the prompts an attacker can craft.

What would settle it

Training or fine-tuning an LLM's safety layer on a dataset that includes formal logical expressions and then measuring whether LogiBreak success rate drops to near zero.

read the original abstract

Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogiBreak shows a workable format shift to logical expressions for jailbreaking but provides no measurements to confirm the distributional gap is the actual cause.

read the letter

The paper's main point is that turning harmful prompts into formal logical expressions lets them slip past LLM safety training because those inputs sit outside the distribution of alignment data. They test LogiBreak on a multilingual set spanning three languages and report it keeps the original meaning while evading refusals across settings. That translation step is the concrete new piece here, and the black-box nature plus the cross-language checks are straightforward to follow and potentially reusable for other attack studies. The multilingual results give a bit of breadth that some earlier jailbreak papers lack. The soft spot is exactly the one in the stress-test note: nothing in the work measures whether the logical versions actually differ in distribution from alignment prompts. No perplexity numbers, no embedding distances, no token-level stats. Without those, the success could just as easily come from the logical structure itself prompting step-by-step reasoning that bypasses refusal patterns. The hypothesis is stated clearly but the evidence does not yet pin it down. This is the sort of paper that would interest people running empirical studies on LLM vulnerabilities or trying to harden alignment against format tricks. A reader looking for new attack templates could extract the translation method and test it themselves. It is not deep enough on mechanism for someone wanting causal insight into safety failures. It deserves a serious referee to check the full experimental details and ask for the missing distributional measurements. I would send it for review with a request to add those controls before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper hypothesizes that LLM safety vulnerabilities arise from distributional discrepancies between alignment-oriented prompts and malicious ones. It introduces LogiBreak, a black-box method that translates harmful natural-language prompts into formal logical expressions to exploit this gap while preserving semantic intent and readability, and evaluates the approach on a multilingual jailbreak dataset spanning three languages, claiming effectiveness across various settings and linguistic contexts.

Significance. If the central claim holds and is supported by rigorous measurements, the work would provide evidence that current safety alignments are brittle to format shifts toward logical expressions, with potential implications for developing more robust, distributionally aware alignment techniques in multilingual settings.

major comments (2)

[Abstract] Abstract: The manuscript states that LogiBreak 'demonstrat[es] its effectiveness' on a multilingual dataset but supplies no quantitative results, success rates, baselines, error analysis, or methodology details, preventing verification of whether the data supports the distributional-gap hypothesis.
[Abstract] Abstract: No supporting measurements (perplexity, token probabilities, or embedding-space distances) are reported to establish that logical expressions differ in distribution from alignment data; without these, alternative mechanisms such as logical structure promoting step-by-step reasoning cannot be ruled out.

minor comments (2)

The method name 'LogiBreak' in the abstract is inconsistent with the title's 'Logic Jailbreak'; uniform terminology would aid readability.
[Abstract] The abstract refers to 'various evaluation settings' without enumerating them or describing the evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, clarifying the content of the full paper and indicating revisions to strengthen the presentation of results and support for the central hypothesis.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states that LogiBreak 'demonstrat[es] its effectiveness' on a multilingual dataset but supplies no quantitative results, success rates, baselines, error analysis, or methodology details, preventing verification of whether the data supports the distributional-gap hypothesis.

Authors: The abstract provides a high-level summary of the evaluation, while the full manuscript reports quantitative results in the Experiments section, including attack success rates across the three languages, comparisons against multiple jailbreak baselines, error analysis, and full methodology details. To improve immediate verifiability from the abstract, we will revise it to include key quantitative highlights such as overall and per-language success rates along with a brief reference to the evaluation protocol. revision: yes
Referee: [Abstract] Abstract: No supporting measurements (perplexity, token probabilities, or embedding-space distances) are reported to establish that logical expressions differ in distribution from alignment data; without these, alternative mechanisms such as logical structure promoting step-by-step reasoning cannot be ruled out.

Authors: We agree that explicit distributional measurements would provide stronger direct evidence for the hypothesized gap. The current results show that logical expressions substantially outperform natural-language equivalents, which indirectly supports a distributional explanation, but we will add a new analysis subsection reporting perplexity scores, average token probabilities under the aligned model, and cosine distances in embedding space between logical prompts and typical alignment data. This will help rule out alternative mechanisms such as step-by-step reasoning induced by logical form. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical attack method is self-contained

full rationale

The paper states a hypothesis about distributional discrepancies between alignment data and malicious prompts, then introduces LogiBreak as a black-box translation method evaluated on external multilingual datasets. No equations, fitted parameters, or self-citations are used to derive the central claim; success is presented as an empirical observation rather than a reduction to the hypothesis by construction. The derivation chain does not loop back to inputs via definition or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the approach rests on the stated hypothesis of distributional discrepancies without further decomposition.

pith-pipeline@v0.9.0 · 5670 in / 1016 out tokens · 49549 ms · 2026-05-22T15:07:08.552875+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we posit the following inclusion relationship: Xharmful ⊂ Daligned

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
cs.CR 2026-05 unverdicted novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...