pith. sign in

arxiv: 2503.24191 · v3 · pith:XGNOJDYWnew · submitted 2025-03-31 · 💻 cs.CR · cs.AI

When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output

Pith reviewed 2026-05-22 21:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords constrained decodingLLM jailbreakstructured outputcontrol plane attackAI securitygrammar-guided generationadversarial attacks on LLMs
0
0 comments X

The pith

Structured output APIs let attackers force malicious prefixes into LLM generation via grammar constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that grammar-guided decoding in structured output APIs creates a control-plane vulnerability in LLMs. Attackers inject a malicious prefix through schema-enforced logit masking, after which the model completes the harmful intent on its own. This control-to-semantic pipeline differs from traditional jailbreaks that manipulate visible inputs. A reader would care because it reveals that safety alignments focused on data-plane inputs leave models open when using structured generation. The work shows DictAttack reaching 94.3-99.5 percent success on leading models and 75.8 percent against current guardrails.

Core claim

Constrained Decoding Attack (CDA) is a jailbreak class that targets the LLM control plane as a control-to-semantic pipeline: schema-enforced logit masking injects a malicious prefix into the generation trajectory, and the model itself completes the harmful intent. CDA is instantiated as EnumAttack, which hides content in enum fields, and DictAttack, which decouples the payload across a benign prompt and dictionary-based grammar. DictAttack reaches 94.3--99.5 percent attack success rate on models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, and sustains 75.8 percent success against SOTA jailbreak guardrails, exposing a semantic gap that requires cross-plane defenses.

What carries the argument

The control-to-semantic pipeline of Constrained Decoding Attack (CDA), in which schema-enforced logit masking forces a malicious prefix that the model then follows to completion.

If this is right

  • EnumAttack can be stopped by basic grammar auditing while DictAttack cannot.
  • DictAttack maintains 75.8 percent success rate against current state-of-the-art jailbreak guardrails.
  • A semantic gap exists between data-plane and control-plane defenses that must be bridged.
  • Flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b remain highly susceptible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of structured output APIs may need to add runtime checks that inspect the effect of user-provided grammars on logit masking.
  • Safety evaluations should test models under constrained decoding conditions rather than free-text generation alone.
  • The same control-plane issue could appear in any system that lets users supply custom schemas or grammars for generation.

Load-bearing premise

That safety alignment cannot detect or block completion of a harmful intent once a malicious prefix has been forced by the grammar constraints.

What would settle it

An experiment in which a model with intact safety training refuses to complete the harmful intent even after the grammar has forced the malicious prefix into the output trajectory.

read the original abstract

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) increasingly serve as tooling platforms through structured output APIs, but the grammar-guided decoding that powers this feature opens a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a new jailbreak class that targets the LLM control plane. CDA is best characterized as a control-to-semantic pipeline: (1) schema-enforced logit masking injects a malicious prefix into the generation trajectory, and (2) the model itself completes the harmful intent. Unlike data-plane jailbreaks that rely on bypassing alignment with visible inputs, CDA acts on the decoding process itself, so internal safety alignment alone cannot stop it. We instantiate CDA with EnumAttack, which hides malicious content in enum fields, and the more evasive DictAttack, which decouples the payload across a benign prompt and a dictionary-based grammar. Across 13 proprietary/open-weight models and five standard benchmarks, DictAttack achieves 94.3--99.5% Attack Success Rate (ASR) on flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. While basic grammar auditing mitigates EnumAttack, DictAttack still sustains 75.8% ASR against SOTA jailbreak guardrails, exposing a "semantic gap" that demands cross-plane defenses bridging the data and control planes. Project page and code are available at https://ict-cda.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Constrained Decoding Attack (CDA) as a new jailbreak class targeting the control plane of LLMs through structured output APIs and grammar-guided decoding. It describes a control-to-semantic pipeline instantiated as EnumAttack (hiding malicious content in enum fields) and the more evasive DictAttack (decoupling payload across benign prompt and dictionary grammar), claiming DictAttack achieves 94.3--99.5% ASR on 13 models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b, plus 75.8% ASR against SOTA guardrails even after basic grammar auditing mitigates EnumAttack, and calls for cross-plane defenses.

Significance. If the empirical results and attack constructions hold under scrutiny, the work would be significant for identifying an attack surface orthogonal to data-plane jailbreaks, as it shows that schema-enforced logit masking can bypass internal safety alignment by acting directly on the decoding trajectory.

major comments (1)
  1. [Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 94.3--99.5% ASR on flagship models and 75.8% ASR against guardrails are stated without any description of attack grammar definitions, the control-to-semantic pipeline implementation, benchmark prompts, success criteria, ablation on grammar auditing, experimental methodology, or verification steps, rendering the reported rates impossible to assess or reproduce.

    Authors: The abstract is a concise summary by design and does not contain the full methodological details. The complete descriptions of attack grammar definitions, the control-to-semantic pipeline, benchmark prompts, success criteria, ablation studies on grammar auditing, experimental methodology, and verification steps appear in the main body of the manuscript (Sections 3–6). This structure follows standard academic practice, where abstracts highlight contributions and results while the body supplies the information needed for assessment and reproduction. We are prepared to add a brief sentence to the abstract if the editor requests it. revision: no

Circularity Check

0 steps flagged

No circularity: empirical ASR claims rest on unreproduced experiments, not on any derivation that reduces to inputs

full rationale

The provided abstract contains no equations, no derivation chain, no fitted parameters, and no self-citations. The central claims are purely empirical reports of attack success rates obtained by running DictAttack and EnumAttack against models; these are presented as experimental outcomes rather than results derived from any prior result or definition within the paper. Because no load-bearing step equates a prediction to its own input by construction, the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the existence of a control-to-semantic pipeline in grammar-guided decoding and the empirical ASR measurements; no free parameters are mentioned.

axioms (1)
  • domain assumption Grammar-guided decoding in structured output APIs permits logit masking that can inject malicious prefixes without triggering safety alignment.
    Invoked in the description of the CDA pipeline.
invented entities (3)
  • Constrained Decoding Attack (CDA) no independent evidence
    purpose: New jailbreak class targeting the control plane
    Defined and instantiated in the abstract.
  • EnumAttack no independent evidence
    purpose: CDA instantiation hiding content in enum fields
    Introduced in the abstract.
  • DictAttack no independent evidence
    purpose: More evasive CDA instantiation decoupling payload across prompt and dictionary grammar
    Introduced in the abstract.

pith-pipeline@v0.9.0 · 5837 in / 1455 out tokens · 110898 ms · 2026-05-22T21:50:27.561885+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.