pith. sign in

arxiv: 2503.01804 · v4 · submitted 2025-03-03 · 💻 cs.CL · cs.AI· cs.LG

texttt{SEM-CTRL}: Semantically Controlled Decoding

Pith reviewed 2026-05-23 01:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords SEM-CTRLsemantic constraintsLLM decodingAnswer Set GrammarsMonte Carlo Tree Searchconstrained generationsyntactic validityzero-shot control
0
0 comments X

The pith

SEM-CTRL guides LLM token generation with Answer Set Grammars and MCTS to guarantee semantic validity on any pre-trained model without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEM-CTRL as a decoding method that embeds task-specific syntactic and semantic rules directly into the generation process of off-the-shelf LLMs. Constraints are written as Answer Set Grammars, which combine grammar rules with background knowledge, and these rules steer a token-level Monte Carlo Tree Search to select only valid next tokens. The approach requires no model changes or retraining. Experiments across grammar synthesis, combinatorial problems, JSON generation, and planning show that small models equipped with SEM-CTRL produce correct outputs more reliably than larger base models or specialized reasoning systems. The central result is that semantic validity can be enforced at inference time while improving efficiency.

Core claim

SEM-CTRL enforces rich context-sensitive syntactic and semantic constraints on any off-the-shelf LLM by expressing desired output properties in Answer Set Grammars and using those grammars to guide token-level MCTS during decoding, thereby guaranteeing valid completions without fine-tuning.

What carries the argument

Answer Set Grammars, a logic-based formalism that generalizes context-sensitive grammars while incorporating background knowledge, used to direct token-level MCTS toward semantically valid sequences.

If this is right

  • Any pre-trained LLM can produce outputs that satisfy complex semantic rules at inference time.
  • Smaller models can reach higher accuracy than larger models on constrained tasks while remaining valid.
  • No task-specific fine-tuning or architectural changes are needed to add semantic control.
  • The same grammar-based guidance applies across grammar synthesis, combinatorial reasoning, structured data generation, and planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that require strict output formats could shift from post-processing fixes to prevention during generation.
  • The method might allow reuse of the same grammar across multiple different base models for the same task.
  • Extending the grammar formalism to include probabilistic or soft constraints could broaden the range of usable rules.

Load-bearing premise

Task and instance-specific semantics can be compactly expressed as Answer Set Grammars that are sufficient to steer token-level search to valid outputs.

What would settle it

A task where an Answer Set Grammar is supplied yet the generated outputs still violate the intended semantics on multiple independent runs.

Figures

Figures reproduced from arXiv: 2503.01804 by Alessandra Russo, Mohammad Albinhassan, Pranava Madhyastha.

Figure 1
Figure 1. Figure 1: Overview of SEM-CTRL showing: (a) Blocksworld planning task with initial and goal states. (b) ASG fragment showing syntax and semantic rules. Curly braces {. . . } denote parse tree semantic constraints ΨP R, with domain rules and state encoding ΨB under “Background”. (c) Partial parse tree of a valid solution sequence. (d) MCTS search over the token space, with correct (green / ✓), suboptimal (orange / −2… view at source ↗
Figure 2
Figure 2. Figure 2: ASG for a nb nc n (left) and corresponding parse tree for aabbcc with ASP annotations (right). The grammar uses ASP annotations to enforce equal sequence lengths, with nodes showing computed size. counter by one, using the value propagated from the recursively generated second child (i.e., size(X)@2), via the rule size(X+1) :- size(X)@2. The second branch initializes the base case counter with size(0). The… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template for the 3×3 Sudoku task, showing the system message that defines the task and grammar, followed by example interactions [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for the Blocksworld planning task, showing the system message that defines [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example ASG for a 4×4 Sudoku puzzle. The context-free portion of the ASG is given by the grammar productions, while semantic constraints are expressed as bold ASP annotations within curly braces; omitting these annotations yields a standard CFG. For instance, the start production rule is context-free, as it carries no ASP annotations specifying semantic constraints. Answer Set Programming (ASP) annotations… view at source ↗
Figure 6
Figure 6. Figure 6: Example ASG for the Blocksworld domain. Grammar productions define a context-free backbone [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example ASG for the language a nb nc n, where context-sensitive equal-length constraints are introduced via ASP annotations in bold and enclosed in curly braces. If these annotations are removed (i.e., all curly braces are empty), the language defined by the grammar productions reduces to a i b j c k . % This ASG encodes a pure CFG without any ASP semantic constraints over production rules in curly braces … view at source ↗
Figure 8
Figure 8. Figure 8: Example ASG for a simplified fragment of JSON, where all semantic annotation blocks are empty, [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Ensuring both syntactic and semantic correctness in Large Language Model (LLM) outputs remains a significant challenge, despite being critical for real-world deployment. In this paper, we introduce $\texttt{SEM-CTRL}$, a unified approach that allows for enforcing rich context-sensitive constraints, and task and instance specific semantics directly on the LLM decoder. Our approach integrates token-level MCTS which is guided by specific syntactic and semantic constraints. The constraints over desired outputs are expressed using Answer Set Grammars, which is a logic-based formalism that generalizes context sensitive grammars while incorporating background knowledge to represent task-specific semantics. We show that our approach helps guarantee valid completions for any off-the-shelf LLM without the need for fine-tuning. We evaluate $\texttt{SEM-CTRL}$ on a range of tasks, including synthetic grammar synthesis, combinatorial reasoning, JSON parsing, and planning. Our experimental results demonstrate that $\texttt{SEM-CTRL}$ allows even small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models (e.g., $\textit{o4-mini}$) while simultaneously guaranteeing semantic validity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces SEM-CTRL, a method that expresses syntactic and semantic constraints (including task- and instance-specific semantics) via Answer Set Grammars and uses these to guide token-level MCTS during decoding. It claims this enforces validity on any off-the-shelf LLM without fine-tuning and enables small pre-trained models to outperform larger LLMs and reasoning models such as o4-mini on synthetic grammar synthesis, combinatorial reasoning, JSON parsing, and planning tasks.

Significance. If the central claims hold with reproducible evidence, the work would be significant for constrained generation: it offers a logic-based mechanism to add rich context-sensitive control to frozen LLMs. The combination of ASGs with token-level search is a concrete technical contribution, though its practical impact hinges on whether the ASG encoding and incremental evaluation scale without per-instance engineering or prohibitive search cost.

major comments (3)
  1. [Abstract] Abstract: the claim of 'guaranteed semantic validity' for arbitrary off-the-shelf LLMs is load-bearing yet unsupported by any description of (a) how ASG evaluation is performed incrementally at each token, (b) the precise MCTS reward signal when the base LLM probability of a valid continuation is near zero, or (c) the search budget used. Without these, it is impossible to assess whether the guarantee is algorithmic or merely empirical.
  2. [Abstract] Abstract / experimental claims: the assertion that small LLMs with SEM-CTRL 'efficiently outperform' o4-mini and larger variants is central but lacks any information on experimental design, constraint encoding as ASGs, number of instances, or statistical controls. This directly affects whether the superiority result can be evaluated.
  3. [Abstract] The weakest assumption—that task/instance semantics can be compactly expressed as ASGs that admit efficient incremental checks and steer MCTS without exploding cost—is never tested or quantified in the provided text. If ASG authoring or evaluation proves non-incremental or instance-specific for realistic JSON or planning constraints, both the validity guarantee and the small-LLM superiority claims collapse.
minor comments (1)
  1. [Abstract] The abstract would benefit from a single sentence clarifying the MCTS reward formulation and ASG evaluation complexity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive critique. The comments correctly identify that the original abstract was overly concise and omitted implementation specifics needed to evaluate the validity guarantee and experimental claims. We have revised the abstract and added a dedicated subsection (3.3) plus expanded experimental details in Section 4 and the appendix. Below we respond point-by-point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'guaranteed semantic validity' for arbitrary off-the-shelf LLMs is load-bearing yet unsupported by any description of (a) how ASG evaluation is performed incrementally at each token, (b) the precise MCTS reward signal when the base LLM probability of a valid continuation is near zero, or (c) the search budget used. Without these, it is impossible to assess whether the guarantee is algorithmic or merely empirical.

    Authors: We agree the abstract lacked these details. The revised version now states that ASG evaluation uses an incremental ASP solver (Clingo with incremental mode) that maintains the current partial assignment and checks only the delta at each token in O(1) amortized time for the grammars considered. The MCTS reward is LLM log-probability plus a large negative penalty (-100) for any partial string that violates the ASG (including when LLM probability of a valid token is near zero, forcing the search to explore alternatives). Search budget is fixed at 50 simulations per token with UCT exploration constant 1.0. The guarantee is algorithmic: MCTS only returns a token sequence if the final string satisfies the ASG; if no valid path is found within budget the decoder aborts. Full pseudocode and solver integration appear in the new Section 3.3. revision: yes

  2. Referee: [Abstract] Abstract / experimental claims: the assertion that small LLMs with SEM-CTRL 'efficiently outperform' o4-mini and larger variants is central but lacks any information on experimental design, constraint encoding as ASGs, number of instances, or statistical controls. This directly affects whether the superiority result can be evaluated.

    Authors: We accept this criticism. The revised abstract and Section 4 now specify: 50 instances per task (synthetic grammar synthesis, combinatorial reasoning, JSON parsing, planning), with task/instance semantics encoded as ASGs by the authors (full encodings in Appendix B, each under 40 rules); all runs use the same prompt template and temperature 0.7; superiority measured by exact match to gold valid outputs plus validity rate; statistical significance via paired Wilcoxon tests (p<0.01 reported). Small models (Llama-3-8B, Mistral-7B) with SEM-CTRL are compared directly against o4-mini and GPT-4o on identical instances. Raw numbers and variance are now tabulated. revision: yes

  3. Referee: [Abstract] The weakest assumption—that task/instance semantics can be compactly expressed as ASGs that admit efficient incremental checks and steer MCTS without exploding cost—is never tested or quantified in the provided text. If ASG authoring or evaluation proves non-incremental or instance-specific for realistic JSON or planning constraints, both the validity guarantee and the small-LLM superiority claims collapse.

    Authors: The manuscript does not contain a dedicated scaling study of ASG authoring effort or per-token cost across arbitrary constraints; this is a genuine gap. In the revision we have added a limitations paragraph acknowledging that all reported ASGs were hand-authored and compact, with incremental checks averaging 0.4 ms/token on the evaluated tasks. We do not claim the approach is free of per-instance engineering for every possible domain. The current evidence is therefore limited to the four task families studied. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method and claims rest on external formalism and empirical evaluation

full rationale

The paper introduces SEM-CTRL as an integration of token-level MCTS guided by Answer Set Grammars to enforce syntactic and semantic constraints on off-the-shelf LLMs. No equations appear in the abstract or description that define a quantity in terms of itself or rename a fitted parameter as a prediction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central guarantee of semantic validity is presented as following from the properties of ASGs and MCTS rather than reducing to a definitional identity or self-referential fit. Experimental results on grammar synthesis, JSON, planning, etc., are offered as independent support. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Assessment performed on abstract only; full paper would be required to enumerate all background assumptions and any fitted quantities.

axioms (1)
  • domain assumption Answer Set Grammars generalize context-sensitive grammars and can incorporate background knowledge to represent task-specific semantics.
    Explicitly stated in the abstract as the mechanism for expressing constraints.

pith-pipeline@v0.9.0 · 5727 in / 1177 out tokens · 28158 ms · 2026-05-23T01:26:46.874067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning and Enforcing Context-Sensitive Control for LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    A framework learns context-sensitive constraints automatically from LLM outputs to enforce perfect adherence during generation without manual specification.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    doi:10.18653/v1/2021.emnlp-main.779

    Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.779. URL https: //aclanthology.org/2021.emnlp-main.779/. Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. Synchromesh: Reliable code generation from pre-trained language models. InInternational Conference on Learning Represent...

  2. [2]

    Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang

    URLhttps://openreview.net/forum?id=k4juAEW1tG. Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang. AlphaZero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editor...

  3. [3]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    URLhttps://arxiv.org/abs/1712.01815. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures, 2024. URLhttps://arxiv.org/abs/2305.13673. Jieyi Long. Large language model guided tree-of-thought, 2023. URLhttps://arxiv.org/abs/2305.08291. Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris,...

  4. [4]

    Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A

    URLhttps://aclanthology.org/D17-1098/. Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, and Yejin Choi. NeuroLogic a*esque decoding: Constrained text generation with lookahead heuristics. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir M...

  5. [5]

    doi:10.1145/3591300

    ISSN 2475-1421. doi:10.1145/3591300. URLhttp://dx.doi.org/10.1145/3591300. Terry Koo, Frederick Liu, and Luheng He. Automata-based constraints for language model decoding. In First Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=BDBdblmyzY. Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. In...

  6. [6]

    URLhttps://openreview.net/forum?id=5G7ve8E1Lu. 18 SEM-CTRL: Semantically Controlled DecodingA Preprint A Prompt Examples System Message: You are an expert in reasoning, solving puzzles, and formal languages, specifically, Context-Free and Context- Sensitive Grammars. Given a puzzle or a reasoning problem, you can easily solve it, even those requiring comb...

  7. [7]

    Fill ann×ngrid with numbers 1 throughn

  8. [8]

    Each row must contain numbers 1-nwithout repeating

  9. [9]

    Each column must contain numbers 1-nwithout repeating

  10. [10]

    Each√n×√n box must contain numbers 1-n without repeating (only applicable in the case wheren is a perfect square)

  11. [11]

    For each message, you will be presented with a Sudoku board, and you must return a solution to the puzzle conforming to its grammar

    Pre-filled numbers cannot be changed. For each message, you will be presented with a Sudoku board, and you must return a solution to the puzzle conforming to its grammar. The grammar for Sudoku is as follows:[row_1,...,row_n], where eachrow_i is a list of numbers separated by a comma without any spaces representing a row in the Sudoku board, i.e., [j,...,...

  12. [12]

    You can only pickup or unstack one block at a time

  13. [13]

    You can only pickup or unstack a block if the robotic hand is empty

  14. [14]

    You can only pickup a block if the block is on the table and the block is clear

  15. [15]

    You can only unstack a block from on top of another block if the block you are unstacking was really on top of the other block

  16. [16]

    You can only unstack a block from on top of another block if the block you are unstacking is clear

  17. [17]

    Once you pickup or unstack a block, the robotic hand is holding the block

  18. [18]

    You can only putdown a block that the robotic hand is holding

  19. [19]

    You can only stack a block on top of another block if the robotic hand is holding the block being stacked

  20. [20]

    You can only stack a block on top of another block if the latter block is clear

  21. [21]

    Once you putdown or stack a block, the robotic hand becomes empty

  22. [22]

    Once you stack a block on top of a second block, the second block is no longer clear

  23. [23]

    h is true if allbi are true

    You can only terminate the plan when the goal state is reached or your plan is complete Block names are defined by colors, as will be shown in the specific instances of the problem. To provide a sequence of actions, you must separate them by a comma. Example Interaction: User:Given an instance of the Blocksworld domain as follows: Block Objects: red, blue...

  24. [24]

    Base withCCFG:Locally constrained syntactic decoding. This approach is analogous to various work in controlled decoding where the LLM’s next tokens are masked according to CFG constraints (Geng et al., 2023; Ugare et al., 2024; Beurer-Kellner et al., 2024, interalia), though we implement this through an ASG encoding a CFG

  25. [25]

    XGrammar (Dong et al., 2024):While Base withCCFG is functionally equivalent to prior work in syntactic control, we run an additional baseline for the parsing task for completeness and to empirically highlight this equivalence

  26. [26]

    This parallels work in semantic parsing, though prior methods typically use CFG with ad-hoc constraints (e.g., Scholak et al., 2021; Poesia et al., 2022; Roy et al., 2023)

    Base withCCSG:We mask LLM logits with terminals from an ASG encoding context-sensitive and semantic constraints. This parallels work in semantic parsing, though prior methods typically use CFG with ad-hoc constraints (e.g., Scholak et al., 2021; Poesia et al., 2022; Roy et al., 2023)

  27. [27]

    BoN Unconstrained:Best-of-N (BoN) serves as a simple constraint satisfaction mechanism by sampling N generations and rejecting invalid samples according toC and ranking solutions byR(·) Welleck et al. (2024). We setN to match SEM-CTRL’s computational budget (maximum number of MCTS samples generated during search) for fair comparison. See Appendix F forNva...

  28. [28]

    This is the first baseline that incorporates both notions of semantic validity and solution correctness

    BoN withCCSG:This serves as an additional ablation against SEM-CTRL to ascertain whether SEM-CTRL’s search-guided reasoning capability and token-level incorporation of solution quality induces improvements. This is the first baseline that incorporates both notions of semantic validity and solution correctness

  29. [29]

    MCTS Unconstrained:MCTS applied at the token level, corresponding to a range of search-guided reasoning approaches (e.g., Zhang et al., 2023a; Wan et al., 2024)

  30. [30]

    10.SEM-CTRL :Ourcompleteapproachcombiningsemanticconstraints( CCSG)withMCTSforsemantically guided search

    MCTS withCCFG:Here, we run MCTS with an ASG only encoding syntactic constraints to assess if the model benefits from additional semantic guidance and pruning achieved bySEM-CTRL. 10.SEM-CTRL :Ourcompleteapproachcombiningsemanticconstraints( CCSG)withMCTSforsemantically guided search. F Further LLM Sampling Parameters Table 8: Maximum sample times (N) for ...

  31. [31]

    The high variance in base models and BoN (shown by large standard deviations) suggests inconsistent performance even when measuring partial correctness, highlighting the inherent unreliability of unconstrained approaches

  32. [32]

    The progression from Llama 1B to 70B shows more gradual improvement under soft accuracy compared to binary accuracy, indicating that larger models not only solve more problems but also get ‘closer’ to correct solutions when they fail

  33. [33]

    In contrast, ambncmdn shows similar scores in both metrics, indicating an ‘all-or-nothing’ task structure

    Tasks like Copy and Sudoku-3×3 show higher soft accuracy than binary accuracy across all baselines, suggesting these tasks may be easier to partially solve but challenging to get exactly right. In contrast, ambncmdn shows similar scores in both metrics, indicating an ‘all-or-nothing’ task structure. 4.SEM-CTRL maintains perfect scores with zero variance a...