When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output
Pith reviewed 2026-05-22 21:50 UTC · model grok-4.3
The pith
Structured output APIs let attackers force malicious prefixes into LLM generation via grammar constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Constrained Decoding Attack (CDA) is a jailbreak class that targets the LLM control plane as a control-to-semantic pipeline: schema-enforced logit masking injects a malicious prefix into the generation trajectory, and the model itself completes the harmful intent. CDA is instantiated as EnumAttack, which hides content in enum fields, and DictAttack, which decouples the payload across a benign prompt and dictionary-based grammar. DictAttack reaches 94.3--99.5 percent attack success rate on models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, and sustains 75.8 percent success against SOTA jailbreak guardrails, exposing a semantic gap that requires cross-plane defenses.
What carries the argument
The control-to-semantic pipeline of Constrained Decoding Attack (CDA), in which schema-enforced logit masking forces a malicious prefix that the model then follows to completion.
If this is right
- EnumAttack can be stopped by basic grammar auditing while DictAttack cannot.
- DictAttack maintains 75.8 percent success rate against current state-of-the-art jailbreak guardrails.
- A semantic gap exists between data-plane and control-plane defenses that must be bridged.
- Flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b remain highly susceptible.
Where Pith is reading between the lines
- Developers of structured output APIs may need to add runtime checks that inspect the effect of user-provided grammars on logit masking.
- Safety evaluations should test models under constrained decoding conditions rather than free-text generation alone.
- The same control-plane issue could appear in any system that lets users supply custom schemas or grammars for generation.
Load-bearing premise
That safety alignment cannot detect or block completion of a harmful intent once a malicious prefix has been forced by the grammar constraints.
What would settle it
An experiment in which a model with intact safety training refuses to complete the harmful intent even after the grammar has forced the malicious prefix into the output trajectory.
read the original abstract
Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) increasingly serve as tooling platforms through structured output APIs, but the grammar-guided decoding that powers this feature opens a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a new jailbreak class that targets the LLM control plane. CDA is best characterized as a control-to-semantic pipeline: (1) schema-enforced logit masking injects a malicious prefix into the generation trajectory, and (2) the model itself completes the harmful intent. Unlike data-plane jailbreaks that rely on bypassing alignment with visible inputs, CDA acts on the decoding process itself, so internal safety alignment alone cannot stop it. We instantiate CDA with EnumAttack, which hides malicious content in enum fields, and the more evasive DictAttack, which decouples the payload across a benign prompt and a dictionary-based grammar. Across 13 proprietary/open-weight models and five standard benchmarks, DictAttack achieves 94.3--99.5% Attack Success Rate (ASR) on flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. While basic grammar auditing mitigates EnumAttack, DictAttack still sustains 75.8% ASR against SOTA jailbreak guardrails, exposing a "semantic gap" that demands cross-plane defenses bridging the data and control planes. Project page and code are available at https://ict-cda.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Constrained Decoding Attack (CDA) as a new jailbreak class targeting the control plane of LLMs through structured output APIs and grammar-guided decoding. It describes a control-to-semantic pipeline instantiated as EnumAttack (hiding malicious content in enum fields) and the more evasive DictAttack (decoupling payload across benign prompt and dictionary grammar), claiming DictAttack achieves 94.3--99.5% ASR on 13 models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, plus 75.8% ASR against SOTA guardrails even after basic grammar auditing mitigates EnumAttack, and calls for cross-plane defenses.
Significance. If the empirical results and attack constructions hold under scrutiny, the work would be significant for identifying an attack surface orthogonal to data-plane jailbreaks, as it shows that schema-enforced logit masking can bypass internal safety alignment by acting directly on the decoding trajectory.
major comments (1)
- [Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.
Simulated Author's Rebuttal
We thank the referee for their review of our manuscript. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.
Authors: The abstract is a concise summary by design and does not contain the full methodological details. The complete descriptions of attack grammar definitions, the control-to-semantic pipeline, benchmark prompts, success criteria, ablation studies on grammar auditing, experimental methodology, and verification steps appear in the main body of the manuscript (Sections 3–6). This structure follows standard academic practice, where abstracts highlight contributions and results while the body supplies the information needed for assessment and reproduction. We are prepared to add a brief sentence to the abstract if the editor requests it. revision: no
Circularity Check
No circularity: empirical ASR claims rest on unreproduced experiments, not on any derivation that reduces to inputs
full rationale
The provided abstract contains no equations, no derivation chain, no fitted parameters, and no self-citations. The central claims are purely empirical reports of attack success rates obtained by running DictAttack and EnumAttack against models; these are presented as experimental outcomes rather than results derived from any prior result or definition within the paper. Because no load-bearing step equates a prediction to its own input by construction, the circularity score is 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Grammar-guided decoding in structured output APIs permits logit masking that can inject malicious prefixes without triggering safety alignment.
invented entities (3)
-
Constrained Decoding Attack (CDA)
no independent evidence
-
EnumAttack
no independent evidence
-
DictAttack
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.