Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Aimin Zhang; Chen Lv; Fangzheng Li

arxiv: 2606.25605 · v1 · pith:L7BU6EDQnew · submitted 2026-06-24 · 💻 cs.CL

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Fangzheng Li , Aimin Zhang , Chen Lv This is my paper

Pith reviewed 2026-06-25 20:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool callingstructured outputJSON schemaopen-weight LLMstool suppressionconstraint interactionagent systemsgrammar constraints

0 comments

The pith

Joint tool calling and JSON schema constraints in open-weight LLMs cause models to stop invoking tools while still producing schema-compliant output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that enabling both tool calling and structured JSON schema output together leads open-weight models to suppress tool calls, even as they continue to meet the schema requirements at high rates. This Tool Suppression is traced to the way JSON schemas are compiled into grammar-based token masks that render the special tokens for tool calls unreachable during generation. The authors introduce the Constraint Priority Inversion hypothesis to explain the behavior as schema satisfaction overriding action selection under multiple constraints. They demonstrate that a Transparent Two-Pass Execution approach, which separates tool execution from schema-constrained generation, restores tool use without model changes. The finding indicates that separate testing of each capability can miss reliability problems that appear only in combined deployment.

Core claim

When Tool Calling and JSON Schema constraints are simultaneously enabled, multiple open-weight models cease invoking tools despite maintaining high schema compliance. JSON Schema constraints are compiled into grammar-based token masks, causing tool-call tokens to become unreachable during decoding. This provides an implementation-level explanation for the observed behavior. The Constraint Priority Inversion hypothesis suggests that schema satisfaction may dominate action-selection behavior under multiple simultaneous constraints. Transparent Two-Pass Execution decouples tool execution from schema-constrained response generation and restores tool invocation while preserving structured output

What carries the argument

Grammar-based token masks compiled from JSON Schema constraints, which narrow the allowed token set during decoding and make tool-call tokens unreachable.

If this is right

Tool suppression appears consistently across multiple model families when both constraints are applied together.
Schema compliance stays high even while tool invocation drops to zero under joint constraints.
Evaluating tool use and structured output in isolation misses the interaction failure.
Transparent Two-Pass Execution restores tool calls while keeping structured output without retraining.
Production agent systems that combine both features can encounter reliability issues not visible in separate tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent builders may need to test every combination of constraints rather than assuming independent behavior.
The same masking mechanism could affect other structured output formats beyond JSON schema.
Inference-time decoupling strategies might apply to additional constraint types that compete for token space.
Model providers could expose the active token mask during generation to help diagnose similar interactions.

Load-bearing premise

The observed tool suppression is caused by the grammar-based token masking mechanism rather than by prompt construction, training data, or sampling parameters.

What would settle it

A controlled decoding trace that shows tool-call tokens remaining reachable in the grammar mask under an active JSON schema, or an open-weight model that maintains both tool calling and schema compliance without suppression under identical joint conditions.

Figures

Figures reproduced from arXiv: 2606.25605 by Aimin Zhang, Chen Lv, Fangzheng Li.

**Figure 2.** Figure 2: Transparent Two-Pass Execution Architecture [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

read the original abstract

Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed in a production Agent system: when Tool Calling and JSON Schema constraints are simultaneously enabled, multiple open-weight models cease invoking tools despite maintaining high schema compliance. We refer to this behavior as Tool Suppression. Through controlled experiments across multiple model families and deployment settings, we consistently reproduce Tool Suppression under joint constraints, while tool execution and schema compliance remain functional when evaluated independently. Further analysis reveals that JSON Schema constraints are compiled into grammar-based token masks, causing tool-call tokens to become unreachable during decoding. This provides an implementation-level explanation for the observed behavior. To interpret the phenomenon, we formulate the Constraint Priority Inversion (CPI) hypothesis, which suggests that schema satisfaction may dominate action-selection behavior under multiple simultaneous constraints. We present CPI as a behavioral hypothesis consistent with the observed evidence rather than a verified internal mechanism. To mitigate the problem, we propose Transparent Two-Pass Execution, an inference-time strategy that decouples tool execution from schema-constrained response generation. Experimental results show that this approach restores tool invocation while preserving structured output guarantees without requiring model retraining. These findings suggest that evaluating tool use and structured output separately may overlook important reliability issues in production Agent systems. Code, data, and docs will be released at https://github.com/Fzsama/Constrain-Tax-26-06.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool suppression under joint tool-calling and schema constraints is a real production issue with a workable two-pass fix, but the grammar-mask causal account is not directly evidenced.

read the letter

Here's the short version: the paper documents tool suppression when tool calling and structured outputs are forced together in open models, and the two-pass fix looks usable, but the token masking account needs tighter evidence.

The new part is the joint-constraint experiment showing the suppression happens reliably while separate use does not. They cover multiple models and settings, which gives the observation some weight. The mitigation strategy is straightforward and doesn't require changes to the model itself.

Credit to the authors for framing the CPI idea as just a hypothesis consistent with the data rather than claiming they've found the internal cause. That keeps the claims proportionate.

The soft spot is the mechanism. The abstract points to grammar-based token masks excluding tool tokens, but the description does not include direct checks of the mask at the relevant steps. Without that or controls for prompt and sampling, other explanations remain possible. The stress-test note captures this accurately.

This is for engineers and researchers working on agent systems that combine these features. A reader building production agents would get value from the mitigation and the warning about separate testing. It is not a foundational result but a useful empirical note.

The paper deserves peer review because the phenomenon is actionable and the experiments are described as reproducible. Referees can ask for the missing mask details and full numbers.

Referee Report

2 major / 1 minor

Summary. The paper claims that simultaneously enabling tool calling and JSON Schema structured output constraints in open-weight LLMs produces 'Tool Suppression': models cease tool invocations while retaining high schema compliance. The authors attribute this to grammar-based token masks (derived from the schema) rendering tool-call tokens unreachable at decode time. They introduce the Constraint Priority Inversion (CPI) hypothesis as an interpretive frame and propose Transparent Two-Pass Execution as an inference-time mitigation that restores tool use without retraining. The work is positioned as an empirical observation with code release promised.

Significance. If the reported suppression effect proves robust and the token-masking mechanism is directly verified, the result would be significant for production agent systems that combine tool use with structured outputs. The practical mitigation is a clear engineering contribution, and the stated intent to release code, data, and documentation would enable independent verification and extension.

major comments (2)

[Abstract] Abstract: the claim of consistent reproduction 'across multiple model families and deployment settings' is stated without any quantitative metrics, error bars, sample sizes, or full experimental protocol, leaving the magnitude and reliability of Tool Suppression difficult to evaluate.
[Further analysis (token masking explanation)] Further analysis section (token-masking explanation): the assertion that JSON Schema constraints are compiled into grammar-based token masks that make tool-call tokens unreachable is presented as the implementation-level cause, yet no direct evidence (e.g., reported token masks, allowed sets, or mask inspection at positions where a tool call would begin) is supplied. Without this, the causal link remains unverified and alternative factors are not isolated.

minor comments (1)

[Abstract] Abstract: the CPI hypothesis is introduced as interpretive rather than verified; a brief statement of what would constitute a test of the hypothesis would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to improve clarity, evidence, and quantitative reporting. The work remains an empirical study, and we agree that strengthening the presentation of metrics and mechanistic evidence will enhance the contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of consistent reproduction 'across multiple model families and deployment settings' is stated without any quantitative metrics, error bars, sample sizes, or full experimental protocol, leaving the magnitude and reliability of Tool Suppression difficult to evaluate.

Authors: We agree the abstract would benefit from explicit quantitative indicators. In the revision we will add suppression rates (with sample sizes and standard deviations), the number of model families and deployment backends tested, and a concise statement of the experimental protocol. The full protocol and per-model breakdowns already appear in Section 3 and the appendix; we will also reference them explicitly from the abstract. revision: yes
Referee: [Further analysis (token masking explanation)] Further analysis section (token-masking explanation): the assertion that JSON Schema constraints are compiled into grammar-based token masks that make tool-call tokens unreachable is presented as the implementation-level cause, yet no direct evidence (e.g., reported token masks, allowed sets, or mask inspection at positions where a tool call would begin) is supplied. Without this, the causal link remains unverified and alternative factors are not isolated.

Authors: We accept that the initial submission did not include direct mask inspection. We will add a new subsection (and accompanying figure) that reports the allowed token sets produced by the grammar compiler at the first decoding step of a tool-call sequence, demonstrating that the relevant tool-call tokens are excluded. These examples will be drawn from the same open inference engines used in the experiments, thereby providing the requested direct evidence and helping isolate the masking mechanism from other potential factors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations and hypothesis formulation independent of inputs

full rationale

The paper presents an empirical study of tool suppression under joint constraints, reproducing the phenomenon across models via controlled experiments. The explanation that JSON Schema constraints compile to grammar-based token masks is stated as arising from 'further analysis' of inference mechanics, but no equations, fitted parameters, or derivations are provided that reduce to the observations by construction. The CPI hypothesis is explicitly framed as a behavioral interpretation consistent with evidence rather than a verified mechanism or self-referential definition. The mitigation strategy (Transparent Two-Pass Execution) is proposed and tested independently. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing. The central claims rest on direct experimental reproduction and code-level inference inspection, making the work self-contained against external benchmarks with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No numeric free parameters are fitted to data. The Constraint Priority Inversion hypothesis is introduced as an interpretive device without external falsifiable evidence. The work relies on standard domain assumptions about how JSON schemas are compiled into token masks in LLM inference engines, but these are not treated as novel axioms.

invented entities (1)

Constraint Priority Inversion (CPI) no independent evidence
purpose: Interpretive hypothesis that schema satisfaction dominates action selection under multiple constraints
Presented explicitly as a behavioral hypothesis consistent with the evidence rather than a verified internal mechanism

pith-pipeline@v0.9.1-grok · 5797 in / 1268 out tokens · 47556 ms · 2026-06-25T20:56:38.079771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 linked inside Pith

[1]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InICLR 2023, 2022

2023
[2]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS 2023, 2023

2023
[3]

Qin and et al

Y . Qin and et al. Tool learning with foundation models.arXiv:2304.08354, 2023

Pith/arXiv arXiv 2023
[4]

X. Lin, J. H. Liew, S. Savarese, and J. Li. W&d: Scaling parallel tool calling for efficient deep research agents. arXiv:2602.07359, 2026

arXiv 2026
[5]

W. Shen, C. Li, H. Chen, M. Yan, X. Quan, H. Chen, and J. Ji. Small llms are weak tool learners: A multi-llm agent.arXiv:2401.07324, 2024

arXiv 2024
[6]

R. Wang, H. Li, X. Han, Y . Zhang, and T. Baldwin. Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv:2402.11651, 2024

arXiv 2024
[7]

M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai. ”We Need Structured Output”: Towards User-centered Constraints on Large Language Model Output.arXiv:2404.07362, 2024

arXiv 2024
[8]

H. Deng, T. Jiang, Y . Shi, C. Zhou, and F. Huang. Decoupling task-solving and output formatting in llm generation. arXiv:2510.03595, 2025

arXiv 2025
[9]

J. Ray. The constraint tax: Measuring validity-correctness tradeoffs in structured outputs for small language models.arXiv:2605.26128, 2026

Pith/arXiv arXiv 2026
[10]

Agyemang, J

P. Agyemang, J. Williams, and T. Chen. Output generation capacity in small language models.Preprint, 2026

2026
[11]

Garbacea and Q

C. Garbacea and Q. Mei. Why is constrained neural language generation particularly challenging? arXiv:2206.05395, 2022

arXiv 2022
[12]

Suresh, S

T. Suresh, S. Li, and F. Huang. Dingo: Constrained inference for diffusion llms.arXiv:2505.23061, 2025

arXiv 2025
[13]

Wang and et al

L. Wang and et al. A survey on large language model based autonomous agents.arXiv:2308.11432, 2023

Pith/arXiv arXiv 2023
[14]

Garcia, A

M. Garcia, A. Singh, and P. Li. Recency bias in mixture-of-experts language models. InACL 2025, 2025

2025
[15]

idx": tc.index,

C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions.arXiv:2304.12244, 2023. A Test Case Design and Tool/Schema Definitions This appendix provides detailed documentation of the test case design, tool definitions, schema specifications, and experiment...

Pith/arXiv arXiv 2023

[1] [1]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InICLR 2023, 2022

2023

[2] [2]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS 2023, 2023

2023

[3] [3]

Qin and et al

Y . Qin and et al. Tool learning with foundation models.arXiv:2304.08354, 2023

Pith/arXiv arXiv 2023

[4] [4]

X. Lin, J. H. Liew, S. Savarese, and J. Li. W&d: Scaling parallel tool calling for efficient deep research agents. arXiv:2602.07359, 2026

arXiv 2026

[5] [5]

W. Shen, C. Li, H. Chen, M. Yan, X. Quan, H. Chen, and J. Ji. Small llms are weak tool learners: A multi-llm agent.arXiv:2401.07324, 2024

arXiv 2024

[6] [6]

R. Wang, H. Li, X. Han, Y . Zhang, and T. Baldwin. Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv:2402.11651, 2024

arXiv 2024

[7] [7]

M. X. Liu, F. Liu, A. J. Fiannaca, T. Koo, L. Dixon, M. Terry, and C. J. Cai. ”We Need Structured Output”: Towards User-centered Constraints on Large Language Model Output.arXiv:2404.07362, 2024

arXiv 2024

[8] [8]

H. Deng, T. Jiang, Y . Shi, C. Zhou, and F. Huang. Decoupling task-solving and output formatting in llm generation. arXiv:2510.03595, 2025

arXiv 2025

[9] [9]

J. Ray. The constraint tax: Measuring validity-correctness tradeoffs in structured outputs for small language models.arXiv:2605.26128, 2026

Pith/arXiv arXiv 2026

[10] [10]

Agyemang, J

P. Agyemang, J. Williams, and T. Chen. Output generation capacity in small language models.Preprint, 2026

2026

[11] [11]

Garbacea and Q

C. Garbacea and Q. Mei. Why is constrained neural language generation particularly challenging? arXiv:2206.05395, 2022

arXiv 2022

[12] [12]

Suresh, S

T. Suresh, S. Li, and F. Huang. Dingo: Constrained inference for diffusion llms.arXiv:2505.23061, 2025

arXiv 2025

[13] [13]

Wang and et al

L. Wang and et al. A survey on large language model based autonomous agents.arXiv:2308.11432, 2023

Pith/arXiv arXiv 2023

[14] [14]

Garcia, A

M. Garcia, A. Singh, and P. Li. Recency bias in mixture-of-experts language models. InACL 2025, 2025

2025

[15] [15]

idx": tc.index,

C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. Wizardlm: Empowering large pre-trained language models to follow complex instructions.arXiv:2304.12244, 2023. A Test Case Design and Tool/Schema Definitions This appendix provides detailed documentation of the test case design, tool definitions, schema specifications, and experiment...

Pith/arXiv arXiv 2023