When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output

Chunwei Xia; Hanyuan Dong; Huimin Cui; Jiacheng Zhao; Ruiyuan Xu; Shuaijiang Li; Shuoming Zhang; Xiaobing Feng; Yangyu Zhang; Yuan Wen

arxiv: 2503.24191 · v3 · pith:XGNOJDYWnew · submitted 2025-03-31 · 💻 cs.CR · cs.AI

When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output

Shuoming Zhang , Jiacheng Zhao , Hanyuan Dong , Ruiyuan Xu , Zhicheng Li , Yangyu Zhang , Shuaijiang Li , Yuan Wen

show 4 more authors

Chunwei Xia Zheng Wang Xiaobing Feng Huimin Cui

This is my paper

Pith reviewed 2026-05-22 21:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords constrained decodingLLM jailbreakstructured outputcontrol plane attackAI securitygrammar-guided generationadversarial attacks on LLMs

0 comments

The pith

Structured output APIs let attackers force malicious prefixes into LLM generation via grammar constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that grammar-guided decoding in structured output APIs creates a control-plane vulnerability in LLMs. Attackers inject a malicious prefix through schema-enforced logit masking, after which the model completes the harmful intent on its own. This control-to-semantic pipeline differs from traditional jailbreaks that manipulate visible inputs. A reader would care because it reveals that safety alignments focused on data-plane inputs leave models open when using structured generation. The work shows DictAttack reaching 94.3-99.5 percent success on leading models and 75.8 percent against current guardrails.

Core claim

Constrained Decoding Attack (CDA) is a jailbreak class that targets the LLM control plane as a control-to-semantic pipeline: schema-enforced logit masking injects a malicious prefix into the generation trajectory, and the model itself completes the harmful intent. CDA is instantiated as EnumAttack, which hides content in enum fields, and DictAttack, which decouples the payload across a benign prompt and dictionary-based grammar. DictAttack reaches 94.3--99.5 percent attack success rate on models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, and sustains 75.8 percent success against SOTA jailbreak guardrails, exposing a semantic gap that requires cross-plane defenses.

What carries the argument

The control-to-semantic pipeline of Constrained Decoding Attack (CDA), in which schema-enforced logit masking forces a malicious prefix that the model then follows to completion.

If this is right

EnumAttack can be stopped by basic grammar auditing while DictAttack cannot.
DictAttack maintains 75.8 percent success rate against current state-of-the-art jailbreak guardrails.
A semantic gap exists between data-plane and control-plane defenses that must be bridged.
Flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b remain highly susceptible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of structured output APIs may need to add runtime checks that inspect the effect of user-provided grammars on logit masking.
Safety evaluations should test models under constrained decoding conditions rather than free-text generation alone.
The same control-plane issue could appear in any system that lets users supply custom schemas or grammars for generation.

Load-bearing premise

That safety alignment cannot detect or block completion of a harmful intent once a malicious prefix has been forced by the grammar constraints.

What would settle it

An experiment in which a model with intact safety training refuses to complete the harmful intent even after the grammar has forced the malicious prefix into the output trajectory.

read the original abstract

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) increasingly serve as tooling platforms through structured output APIs, but the grammar-guided decoding that powers this feature opens a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a new jailbreak class that targets the LLM control plane. CDA is best characterized as a control-to-semantic pipeline: (1) schema-enforced logit masking injects a malicious prefix into the generation trajectory, and (2) the model itself completes the harmful intent. Unlike data-plane jailbreaks that rely on bypassing alignment with visible inputs, CDA acts on the decoding process itself, so internal safety alignment alone cannot stop it. We instantiate CDA with EnumAttack, which hides malicious content in enum fields, and the more evasive DictAttack, which decouples the payload across a benign prompt and a dictionary-based grammar. Across 13 proprietary/open-weight models and five standard benchmarks, DictAttack achieves 94.3--99.5% Attack Success Rate (ASR) on flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. While basic grammar auditing mitigates EnumAttack, DictAttack still sustains 75.8% ASR against SOTA jailbreak guardrails, exposing a "semantic gap" that demands cross-plane defenses bridging the data and control planes. Project page and code are available at https://ict-cda.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract flags a plausible new control-plane attack on structured LLM outputs but supplies zero methodology or verification, so the high ASR numbers cannot be assessed yet.

read the letter

The one thing to know is that this abstract describes Constrained Decoding Attack as a distinct way to target the grammar-guided decoding step itself rather than the input prompt. If the full paper shows that schema-enforced logit masking can reliably steer generation into harmful completions even when safety alignment is intact, that would be a real addition to the jailbreak literature. The authors split the idea into EnumAttack and the more evasive DictAttack, and they claim DictAttack still works at 75 percent against current guardrails after basic grammar checks are applied. That framing of a control-to-semantic pipeline is the clearest novelty on offer. The paper does a decent job naming the gap between data-plane defenses and the decoding process that structured-output APIs expose, and it points out that production tooling and agent frameworks could be affected if the attack generalizes. Credit for making the distinction explicit and for releasing a project page and code link, even if those are not yet usable from the abstract alone. The soft spots are straightforward and fairly large. No attack grammar definitions, no prompt examples, no success criteria, no ablation on how the malicious prefix is injected, and no comparison against standard baselines appear in the abstract. The 94 to 99 percent ASR figures on gpt-5, gemini-2.5-pro and the rest are therefore impossible to evaluate; they could reflect a genuine vulnerability or they could reflect evaluation choices that have not been described. The stress-test note is accurate on this point. The work is aimed at researchers and engineers who build or secure structured-output APIs and guardrails. A reader looking for early signals of new attack surfaces might skim it for the high-level idea, but anyone needing reproducible evidence will find the current version thin. It deserves a serious referee once the full text and experimental details are supplied, because the underlying question about control-plane exposure is worth checking even if the current numbers turn out to be overstated.

Referee Report

1 major / 0 minor

Summary. The paper introduces Constrained Decoding Attack (CDA) as a new jailbreak class targeting the control plane of LLMs through structured output APIs and grammar-guided decoding. It describes a control-to-semantic pipeline instantiated as EnumAttack (hiding malicious content in enum fields) and the more evasive DictAttack (decoupling payload across benign prompt and dictionary grammar), claiming DictAttack achieves 94.3--99.5% ASR on 13 models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, plus 75.8% ASR against SOTA guardrails even after basic grammar auditing mitigates EnumAttack, and calls for cross-plane defenses.

Significance. If the empirical results and attack constructions hold under scrutiny, the work would be significant for identifying an attack surface orthogonal to data-plane jailbreaks, as it shows that schema-enforced logit masking can bypass internal safety alignment by acting directly on the decoding trajectory.

major comments (1)

[Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our manuscript. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.

Authors: The abstract is a concise summary by design and does not contain the full methodological details. The complete descriptions of attack grammar definitions, the control-to-semantic pipeline, benchmark prompts, success criteria, ablation studies on grammar auditing, experimental methodology, and verification steps appear in the main body of the manuscript (Sections 3–6). This structure follows standard academic practice, where abstracts highlight contributions and results while the body supplies the information needed for assessment and reproduction. We are prepared to add a brief sentence to the abstract if the editor requests it. revision: no

Circularity Check

0 steps flagged

No circularity: empirical ASR claims rest on unreproduced experiments, not on any derivation that reduces to inputs

full rationale

The provided abstract contains no equations, no derivation chain, no fitted parameters, and no self-citations. The central claims are purely empirical reports of attack success rates obtained by running DictAttack and EnumAttack against models; these are presented as experimental outcomes rather than results derived from any prior result or definition within the paper. Because no load-bearing step equates a prediction to its own input by construction, the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the existence of a control-to-semantic pipeline in grammar-guided decoding and the empirical ASR measurements; no free parameters are mentioned.

axioms (1)

domain assumption Grammar-guided decoding in structured output APIs permits logit masking that can inject malicious prefixes without triggering safety alignment.
Invoked in the description of the CDA pipeline.

invented entities (3)

Constrained Decoding Attack (CDA) no independent evidence
purpose: New jailbreak class targeting the control plane
Defined and instantiated in the abstract.
EnumAttack no independent evidence
purpose: CDA instantiation hiding content in enum fields
Introduced in the abstract.
DictAttack no independent evidence
purpose: More evasive CDA instantiation decoupling payload across prompt and dictionary grammar
Introduced in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1455 out tokens · 110898 ms · 2026-05-22T21:50:27.561885+00:00 · methodology

When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)