Capacity, Not Format: Rethinking Structured Reasoning Failures

Hengxin Fan

arxiv: 2606.09410 · v1 · pith:2JBBD34Lnew · submitted 2026-06-08 · 💻 cs.AI · cs.CL

Capacity, Not Format: Rethinking Structured Reasoning Failures

Hengxin Fan This is my paper

Pith reviewed 2026-06-27 16:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords structured outputmodel capacityJSON formattingreasoning performancetoken budgetschema complexitydelayed structure

0 comments

The pith

Structured output formats degrade reasoning only in models operating near their capacity limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that requiring models to produce JSON or other structured outputs is not a fixed tax on reasoning. Instead, the performance cost appears only when a model has little spare capacity left after handling the core task. High-capacity models handle the format constraints with no measurable drop, while lower-capacity ones lose accuracy through truncation under tight token budgets or through direct competition for internal resources even when budgets are expanded. The effect grows with schema complexity and is largely reversed by allowing free reasoning before any formatting step. This reframes prior results on structured reasoning failures as capacity issues rather than inherent format problems.

Core claim

Structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation, as seen when Sonnet maintains 88.7% on MATH-Hard under JSON versus 89.3% under CoT. In contrast, models near their limits drop sharply: Haiku loses 36.2 percentage points largely from truncation under standard budgets, while GPT-4o-mini loses 28.0 points even with extended budgets that eliminate truncation. The penalty scales with schema complexity and is not explained by prompt length. A delayed-structure ablation that lets the model reason freely before formatting recovers most of the lost accuracy.

What carries the argument

information-matched prose controls combined with a four-level schema complexity gradient that isolate format effects across models and benchmarks

If this is right

Models with headroom can adopt structured output without accuracy loss on the tested benchmarks.
Models near capacity limits suffer either truncation or direct resource competition when forced to format.
The accuracy penalty increases reliably with the number of required fields and nesting depth.
Allowing unrestricted reasoning before applying structure restores most of the lost performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompting strategies could dynamically switch between free-form and structured modes based on estimated remaining capacity for a given task.
Training objectives that explicitly reward both reasoning quality and format adherence might reduce the observed capacity competition.
Evaluation suites for new models should report structured versus unstructured accuracy as a function of task difficulty to surface capacity headroom.

Load-bearing premise

The prose controls and schema gradient successfully separate format-specific costs from prompt length and other confounds.

What would settle it

Observing comparable performance drops under JSON even in models with large headroom on information-matched tasks would falsify the capacity-dependence claim.

Figures

Figures reproduced from arXiv: 2606.09410 by Hengxin Fan.

**Figure 2.** Figure 2: Decomposing prompt length vs. format effect on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Format penalty (JSON−CoT accuracy) vs. crossvalidated capacity across 19 model×benchmark cells. Shaded regions indicate capacity regimes: models in the comfort zone (>86%) show negligible format effects, while capacity-limited models (<72%) suffer large penalties. Haiku on MATH-Hard is an outlier driven by token-budget truncation. Spearman rs = 0.51, p = 0.026. 5.2 Why Structure Hurts Weak Models For Haik… view at source ↗

read the original abstract

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows format penalties are mostly capacity competition, backed by controls and a useful ablation, though matching details need checking.

read the letter

The main point is that JSON-style structured output hurts models near their limits through capacity competition or truncation, but models with headroom absorb it without loss. The delayed-structure ablation recovers most accuracy, which points to competition rather than an inherent format tax.

They do solid work separating the effects. Information-matched prose controls plus the four-level schema gradient, run across four models and five benchmarks, let them show the penalty scales with complexity and persists even with extra token budget. The McNemar tests and the small real drop on Opus qualify earlier immunity claims. The ablation is the cleanest piece of evidence for their mechanism.

The soft spot is whether the matching fully removes extra planning load from field mapping and schema adherence. That demand could still interact with capacity in ways not purely about format, and the ablation does not rule it out completely. The abstract leaves the exact verification steps unclear, so the full paper needs to show those controls are tight.

This is for people deploying LLMs with structured outputs who need to decide when to force JSON versus free text. A reader working on capacity-aware prompting or model selection would get practical value. It deserves a serious referee because the empirical separation is new and the results are directly testable.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that structured output formats like JSON impose a capacity-dependent cost on LLMs rather than an inherent reasoning tax. Using information-matched prose controls and a four-level schema complexity gradient across 4 models and 5 benchmarks, it shows high-capacity models absorb JSON without degradation (e.g., Sonnet 88.7% JSON vs 89.3% CoT on MATH-Hard), while lower-capacity models suffer large drops (Haiku 36.2pp due to truncation; GPT-4o-mini 28.0pp even with extended budgets), with the penalty scaling by schema complexity (McNemar p<0.0001) and a delayed-structure ablation recovering most accuracy (80-87%).

Significance. If the results hold, the work reframes structured reasoning failures around capacity competition rather than format, with practical implications for deployment (think first, format later). Strengths include the multi-model/multi-benchmark design, statistical tests, ablation, and explicit handling of parse failures and rounding details, providing falsifiable empirical evidence that qualifies claims of frontier model immunity.

major comments (2)

[Abstract/Experimental Design] Abstract/Experimental Design: The central claim that penalties reflect capacity competition (not format-specific planning) depends on information-matched prose controls successfully isolating effects. The abstract describes the matching and schema gradient but provides no details on verification of equivalent output planning or constraint satisfaction loads, leaving open whether JSON imposes additional demands that interact with capacity (as noted in the stress-test concern).
[Results (GPT-4o-mini)] Results (GPT-4o-mini): The 28.0pp drop (p<0.001) with extended budgets is load-bearing evidence for 'pure capacity competition independent of token exhaustion.' Without explicit description of budget extension implementation, confirmation that truncation was eliminated across all cases, and full controls for other variables, this distinction is difficult to assess from the provided summary.

minor comments (2)

[Abstract] Abstract: The parenthetical on exact difference calculation (7/133 = 5.26pp) for Opus is precise; retain and expand in main text for transparency.
[Abstract] Abstract: Specify the 5 benchmarks and exact four-level schema complexities to support replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the clarity of our experimental controls and implementation details. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract/Experimental Design] Abstract/Experimental Design: The central claim that penalties reflect capacity competition (not format-specific planning) depends on information-matched prose controls successfully isolating effects. The abstract describes the matching and schema gradient but provides no details on verification of equivalent output planning or constraint satisfaction loads, leaving open whether JSON imposes additional demands that interact with capacity (as noted in the stress-test concern).

Authors: The full manuscript (Section 3.2 and Appendix A) details the information-matching procedure, including manual verification on a 200-sample subset for equivalent planning demands and automated constraint-satisfaction checks. We agree the abstract is too concise on this point and will revise it to briefly reference the verification steps while adding an explicit paragraph on load-equivalence testing in the methods section. revision: yes
Referee: [Results (GPT-4o-mini)] Results (GPT-4o-mini): The 28.0pp drop (p<0.001) with extended budgets is load-bearing evidence for 'pure capacity competition independent of token exhaustion.' Without explicit description of budget extension implementation, confirmation that truncation was eliminated across all cases, and full controls for other variables, this distinction is difficult to assess from the provided summary.

Authors: The paper already reports 0% truncation and parse failures under extended budgets (Table 2), but we concur that the implementation protocol requires more explicit description. In revision we will add a dedicated methods subsection specifying the exact budget-doubling procedure, the two-stage generation process, per-run truncation verification (zero cases observed), and cross-references to the full variable controls in Appendix C. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations or self-referential reductions

full rationale

The paper reports direct experimental results across 4 models and 5 benchmarks using information-matched controls, schema gradients, and a delayed-structure ablation. All claims rest on observed accuracy deltas, McNemar tests, and truncation analysis rather than any equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain exists that could reduce outputs to inputs by construction. The central capacity-competition claim is supported by independent measurements (e.g., 28pp drop under extended budgets) and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study. No free parameters are fitted to produce the central claim. No new entities are postulated. Axioms are limited to standard assumptions about controlled experimental design in LLM evaluation.

axioms (1)

domain assumption Information-matched prose controls and schema complexity gradient isolate format effects from length confounds
Invoked in the abstract to justify separation of format-specific effects.

pith-pipeline@v0.9.1-grok · 5866 in / 1275 out tokens · 19685 ms · 2026-06-27T16:32:58.943237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 5 linked inside Pith

[1]

Banerjee, D.; Suresh, T.; Ugare, S.; Misailovic, S.; and Singh, G. 2025. CRANE : Reasoning with Constrained LLM Generation. In Proceedings of the International Conference on Machine Learning

2025
[2]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021
[3]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In NeurIPS Datasets and Benchmarks

2021
[4]

B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361

Pith/arXiv arXiv 2020
[5]

Y.; D'Antoni, L.; and Berg-Kirkpatrick, T

Lee, I. Y.; D'Antoni, L.; and Berg-Kirkpatrick, T. 2026. The Format Tax. arXiv preprint arXiv:2604.03616

Pith/arXiv arXiv 2026
[6]

X.; Ngoc, H

Long, D. X.; Ngoc, H. N.; Sim, T.; Dao, H.; Joty, S.; Kawaguchi, K.; Chen, N. F.; and Kan, M.-Y. 2025. LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs . In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics

2025
[7]

Ray, J. 2026. The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models. arXiv preprint arXiv:2605.26128

Pith/arXiv arXiv 2026
[8]

W.; Chowdhery, A.; Le, Q.; Chi, E.; Zhou, D.; and Wei, J

Suzgun, M.; Scales, N.; Sch \"a rli, N.; Gehrmann, S.; Tay, Y.; Chung, H. W.; Chowdhery, A.; Le, Q.; Chi, E.; Zhou, D.; and Wei, J. 2023. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Findings of the Association for Computational Linguistics: ACL 2023

2023
[9]

R.; Wu, C.-K.; Tsai, Y.-L.; Lin, C.-Y.; Lee, H.-y.; and Chen, Y.-N

Tam, Z. R.; Wu, C.-K.; Tsai, Y.-L.; Lin, C.-Y.; Lee, H.-y.; and Chen, Y.-N. 2024. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

2024
[10]

Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. 2024. MMLU-Pro : A More Robust and Challenging Multi-Task Language Understanding Benchmark. Advances in Neural Information Processing Systems, 37

2024
[11]

Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. 2022 a . Emergent Abilities of Large Language Models. Transactions on Machine Learning Research

2022
[12]

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; and Zhou, D. 2022 b . Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35

2022
[13]

Yang, L.; Yu, Z.; Cui, B.; and Wang, M. 2025. ReasonFlux : Hierarchical LLM Reasoning via Scaling Thought Templates. arXiv preprint arXiv:2502.06772

arXiv 2025
[14]

L.; Cao, Y.; and Narasimhan, K

Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T. L.; Cao, Y.; and Narasimhan, K. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems, volume 36

2023
[15]

Yuan, H.; Zhao, Y.; Zhang, L.; Luo, W.; and Ma, Z. 2026. Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference. Findings of the European Chapter of the Association for Computational Linguistics

2026
[16]

Zhou, H. 2026. From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection. arXiv preprint arXiv:2604.06066

Pith/arXiv arXiv 2026
[17]

V.; Chi, E

Zhou, P.; Pujara, J.; Ren, X.; Chen, X.; Cheng, H.-T.; Le, Q. V.; Chi, E. H.; Zhou, D.; Mishra, S.; and Zheng, H. S. 2024. Self-Discover: Large Language Models Self-Compose Reasoning Structures. In Advances in Neural Information Processing Systems, volume 37

2024

[1] [1]

Banerjee, D.; Suresh, T.; Ugare, S.; Misailovic, S.; and Singh, G. 2025. CRANE : Reasoning with Constrained LLM Generation. In Proceedings of the International Conference on Machine Learning

2025

[2] [2]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021

[3] [3]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In NeurIPS Datasets and Benchmarks

2021

[4] [4]

B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361

Pith/arXiv arXiv 2020

[5] [5]

Y.; D'Antoni, L.; and Berg-Kirkpatrick, T

Lee, I. Y.; D'Antoni, L.; and Berg-Kirkpatrick, T. 2026. The Format Tax. arXiv preprint arXiv:2604.03616

Pith/arXiv arXiv 2026

[6] [6]

X.; Ngoc, H

Long, D. X.; Ngoc, H. N.; Sim, T.; Dao, H.; Joty, S.; Kawaguchi, K.; Chen, N. F.; and Kan, M.-Y. 2025. LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs . In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics

2025

[7] [7]

Ray, J. 2026. The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models. arXiv preprint arXiv:2605.26128

Pith/arXiv arXiv 2026

[8] [8]

W.; Chowdhery, A.; Le, Q.; Chi, E.; Zhou, D.; and Wei, J

Suzgun, M.; Scales, N.; Sch \"a rli, N.; Gehrmann, S.; Tay, Y.; Chung, H. W.; Chowdhery, A.; Le, Q.; Chi, E.; Zhou, D.; and Wei, J. 2023. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In Findings of the Association for Computational Linguistics: ACL 2023

2023

[9] [9]

R.; Wu, C.-K.; Tsai, Y.-L.; Lin, C.-Y.; Lee, H.-y.; and Chen, Y.-N

Tam, Z. R.; Wu, C.-K.; Tsai, Y.-L.; Lin, C.-Y.; Lee, H.-y.; and Chen, Y.-N. 2024. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

2024

[10] [10]

Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. 2024. MMLU-Pro : A More Robust and Challenging Multi-Task Language Understanding Benchmark. Advances in Neural Information Processing Systems, 37

2024

[11] [11]

Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. 2022 a . Emergent Abilities of Large Language Models. Transactions on Machine Learning Research

2022

[12] [12]

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; and Zhou, D. 2022 b . Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35

2022

[13] [13]

Yang, L.; Yu, Z.; Cui, B.; and Wang, M. 2025. ReasonFlux : Hierarchical LLM Reasoning via Scaling Thought Templates. arXiv preprint arXiv:2502.06772

arXiv 2025

[14] [14]

L.; Cao, Y.; and Narasimhan, K

Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T. L.; Cao, Y.; and Narasimhan, K. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Advances in Neural Information Processing Systems, volume 36

2023

[15] [15]

Yuan, H.; Zhao, Y.; Zhang, L.; Luo, W.; and Ma, Z. 2026. Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference. Findings of the European Chapter of the Association for Computational Linguistics

2026

[16] [16]

Zhou, H. 2026. From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection. arXiv preprint arXiv:2604.06066

Pith/arXiv arXiv 2026

[17] [17]

V.; Chi, E

Zhou, P.; Pujara, J.; Ren, X.; Chen, X.; Cheng, H.-T.; Le, Q. V.; Chi, E. H.; Zhou, D.; Mishra, S.; and Zheng, H. S. 2024. Self-Discover: Large Language Models Self-Compose Reasoning Structures. In Advances in Neural Information Processing Systems, volume 37

2024