pith. sign in

arxiv: 2511.21086 · v2 · submitted 2025-11-26 · 💻 cs.CL

Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords orthographic constraintslarge language modelsconstraint satisfactionword puzzleshuman difficulty alignmentmodel familiesparameter scaling
0
0 comments X

The pith

Model family differences create larger gaps than parameter scaling in satisfying orthographic constraints, with only modest human difficulty alignment and systematic failures on common words with unusual spellings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates 39 configurations from three model families on 58 word puzzles that require precise character-level orthographic rules. Cross-family performance gaps reach 2.0 to 2.2 times larger than the gains from scaling parameters within a family, even after ruling out tokenizer effects. All families display modest but consistent correlation with human difficulty ratings collected from 10,000 solvers per puzzle. Models nevertheless fail at high rates on everyday words that follow valid patterns but deviate from typical orthography. The pattern points to reliance on distributional statistics over direct constraint enforcement.

Core claim

The paper establishes that differences across model families produce substantially larger performance gaps in orthographic constraint satisfaction than increases in parameter count within families, accompanied by modest calibration to human difficulty ratings and systematic errors on frequent words with atypical orthography such as data, loll, and acai.

What carries the argument

Orthographic constraint satisfaction tested through 58 word puzzles, measured via F1 scores, partial-correlation analysis separating family effects from scaling and tokenizer confounds, and direct comparison to human solver difficulty ratings.

If this is right

  • Architectural or training differences between families affect orthographic rule following more than raw scale.
  • Increased thinking budget improves results only for high-capacity models while mid-sized variants saturate or decline.
  • Over-reliance on distributional plausibility causes consistent misses on constraint-valid but orthographically atypical patterns.
  • Modest correlation with human ratings holds across families but does not eliminate the identified failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efforts to improve constraint satisfaction may need family-specific techniques rather than uniform scaling laws.
  • The evaluation method could transfer to other domains that require strict rule adherence, such as formal language generation or code synthesis.
  • Persistent failures on common words suggest targeted interventions like orthography-specific training data to reduce distributional bias.

Load-bearing premise

The 58 word puzzles together with 10,000 human ratings per puzzle form a representative sample for drawing conclusions about orthographic constraint satisfaction and human alignment across large language models in general.

What would settle it

A new collection of word puzzles or constrained generation tasks where within-family parameter scaling produces performance gaps equal to or larger than the observed cross-family differences of 2.0-2.2x.

Figures

Figures reproduced from arXiv: 2511.21086 by Bryan E. Tuck, Rakesh M. Verma.

Figure 1
Figure 1. Figure 1: Zero-shot prompt structure. The prompt specifies the seven available letters, marks the mandatory center letter, and enumerates all con￾straints explicitly. Models receive identical specifi￾cations without solution counts, isolating intrinsic constraint-handling from memorization or calibra￾tion to expected output lengths. 3.5. Prompt Design [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-family performance comparison across thinking budgets. Proprietary models achieve 2.0–2.2× higher F1 than the largest open￾source model, with the gap driven primarily by re￾call (68% vs. 23%) rather than precision. Budget sensitivity varies dramatically across families. 4. Evaluation Metrics 4.1. Standard Metrics We evaluate generation quality using precision, recall, and F1 score. For instance i, le… view at source ↗
Figure 3
Figure 3. Figure 3: illustrates heterogeneous budget trajec￾tories across the Qwen3 family and proprietary models, revealing four distinct behavioral classes. Optimal thinking budgets vary dramatically across models, with some showing zero or negative re￾turns from additional tokens. Responsive High-Capacity Models: Dense and MoE Dynamics Within the Qwen3 family, budget responsiveness appears only in the high-capacity variant… view at source ↗
Figure 4
Figure 4. Figure 4: Model-human difficulty calibration using 10,000 solver ratings per puzzle. Left: Performance [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Word length effects on model and human performance. Left: Model recall by word length. Right: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-family evaluation remains limited. We evaluate 39 configurations spanning three model families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Cross-family differences produce substantially larger performance gaps (2.0-2.2x, F1 = 0.761 vs. 0.343) than parameter scaling within families (83% gain from 4B to 32B scaling), and a partial-correlation analysis rules out tokenizer design as a confound for within-family scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade, showing inconsistent compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (\r{ho} = 0.28-0.42) across all families, yet identify systematic failures on common words with unusual orthography ("data", "loll", "acai": 83-91% human success, 94-98% model miss rate). These failures point to over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates 39 configurations across three LLM families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles that require satisfying character-level orthographic constraints. It reports substantially larger performance gaps between families (F1 scores of 0.761 vs. 0.343) than within-family scaling gains (83% improvement from 4B to 32B parameters), heterogeneous returns to increased thinking budget, modest but consistent calibration with human difficulty ratings (Spearman rho = 0.28-0.42), and systematic model failures on common words with atypical orthography (e.g., 'data', 'loll', 'acai') despite high human success rates (83-91%). A partial-correlation analysis is used to rule out tokenizer design as a confound for scaling effects.

Significance. If the empirical patterns hold, the results indicate that model family and architecture exert a stronger influence on orthographic constraint satisfaction than parameter scaling alone, while also revealing a systematic bias toward distributional plausibility that diverges from human performance on atypical but valid patterns. The large-scale human rating collection (10,000 solvers per puzzle) provides a solid basis for the calibration claims and could inform future work on constrained generation and cognitive alignment in LLMs.

major comments (2)
  1. The selection criteria, diversity metrics (word length, frequency, orthographic rarity distribution), and coverage analysis for the 58 word puzzles are not described. This is load-bearing for the central claims comparing cross-family gaps to within-family scaling and for the reported systematic failure modes, because the observed 2.0-2.2x family differences and specific error patterns on atypical orthography could be idiosyncratic to this puzzle set rather than general properties of the model families.
  2. The partial-correlation analysis that rules out tokenizer design as a confound for the within-family scaling result (83% gain) is mentioned but lacks the specific controlled variables, coefficient values, and statistical significance details needed to evaluate its robustness.
minor comments (2)
  1. The abstract uses the notation 'r{ho}' for Spearman correlation; this should be corrected to the standard rho symbol for clarity.
  2. A summary table listing the 39 configurations, their parameter counts, and family membership would improve readability of the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions made.

read point-by-point responses
  1. Referee: The selection criteria, diversity metrics (word length, frequency, orthographic rarity distribution), and coverage analysis for the 58 word puzzles are not described. This is load-bearing for the central claims comparing cross-family gaps to within-family scaling and for the reported systematic failure modes, because the observed 2.0-2.2x family differences and specific error patterns on atypical orthography could be idiosyncratic to this puzzle set rather than general properties of the model families.

    Authors: We acknowledge this omission and agree that explicit documentation of the puzzle selection process is essential for evaluating the generalizability of our findings. In the revised manuscript, we have added a detailed description in the Methods section covering the selection criteria, including stratification by word length, log-frequency from the Google Ngram corpus, and orthographic rarity based on n-gram probabilities. We also include quantitative diversity metrics and a coverage analysis to characterize the distribution of these properties in the 58-puzzle set. These additions support the robustness of the reported cross-family performance gaps and the identification of systematic failure modes on atypical orthography. revision: yes

  2. Referee: The partial-correlation analysis that rules out tokenizer design as a confound for the within-family scaling result (83% gain) is mentioned but lacks the specific controlled variables, coefficient values, and statistical significance details needed to evaluate its robustness.

    Authors: We appreciate the referee highlighting the need for greater transparency in the statistical analysis. We have revised the manuscript to provide a full account of the partial-correlation analysis, including the specific variables controlled for (such as tokenizer vocabulary size and model family), the computed partial correlation coefficients, and associated p-values. This expanded description confirms that the observed scaling benefits within families remain significant after accounting for tokenizer-related factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper's central claims rest on direct empirical measurements: F1 scores from 39 model configurations across 58 word puzzles, cross-family performance gaps, within-family scaling gains, partial-correlation analysis to rule out tokenizer confounds, and Spearman correlations (rho = 0.28-0.42) with 10,000 human difficulty ratings per puzzle. No equations, fitted parameters presented as predictions, self-citations bearing the load of uniqueness or ansatz, or renamings of known results appear in the abstract or described methodology. All reported quantities are falsifiable against the fixed puzzle set and external human data, rendering the derivation self-contained with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical evaluation study using existing commercial and open models plus crowdsourced human ratings; no new theoretical constructs or fitted parameters are introduced in the abstract.

axioms (1)
  • domain assumption The 58 selected word puzzles sufficiently represent the space of orthographic constraint satisfaction challenges.
    Invoked to support generalization from the specific test set to broader LLM behavior.

pith-pipeline@v0.9.0 · 5533 in / 1268 out tokens · 46111 ms · 2026-05-17T05:24:14.166068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Introduction How do large language models satisfy hard ortho- graphic constraints during text generation? Con- sider a model asked to generate English words us- ing only the letters {A, G, I, L, N, O, W}, where every word must contain W and have at least four letters. This task requires satisfying discrete character- level rules, a fundamental challenge d...

  2. [2]

    Constraint satisfaction may require specialized architectural features, training data characteristics, or optimization objectives beyond general language modeling

    Cross-family performance characterization (Section 5.1): Architectural differences pro- duce substantially larger performance gaps (2.0–2.2×) than parameter scaling within fam- ilies (83% gain from eightfold increase), with the gap manifesting primarily through recall rather than precision. Constraint satisfaction may require specialized architectural fea...

  3. [3]

    These patterns are inconsistent with uniform com- pute scaling

    Heterogeneous budget sensitivity(Sec- tion 5.2): Thinking budget effects vary dra- matically across models: high-capacity vari- ants show strong returns (+0.102 to +0.136 F1), mid-size models degrade monotonically with increased allocation, and the mixture-of- experts 30B variant requires substantial bud- get for productive expert composition. These patte...

  4. [4]

    These failures reveal over-reliance on distributional plausibility that penalizes or- thographically atypical but constraint-valid pat- terns, independent of vocabulary knowledge

    Human difficulty alignment with mechanis- tic failure analysis(Sections 5.3, 5.4): Using 10,000 solver ratings per puzzle, we estab- lish modest but consistent calibration (r=0.24– 0.38) across families, yet identify systematic failures on common words with unusual or- thography. These failures reveal over-reliance on distributional plausibility that pena...

  5. [5]

    Related Work Orthographic Knowledge in Language Models. Language models encode orthographic structure beyond token frequencies (Itzhak et al., 2022), yet limitations persist: multilingual LLMs over-weight orthographic similarity when processing interlin- gual homographs (Tanwar et al., 2025), show systematic gaps in grapheme-to-phoneme map- ping (Suvarna ...

  6. [6]

    wagon” is valid (uses only available letters, includes W, length ≥ 4) while “along

    Experimental Setup 3.1. Task Definition Models generate English words satisfying explicit orthographic constraints. Each puzzle specifies seven unique letters, one designated as manda- tory. Valid outputs must satisfy three constraints: minimum four letters, exclusive use of the seven available letters (repetition permitted), and manda- tory inclusion of ...

  7. [7]

    Start with the center letter and build words around it

  8. [8]

    Focus on word patterns: common roots, prefixes, suffixes

  9. [9]

    Skip words with letters NOT in the available set

  10. [10]

    Find as many valid English words as possible

  11. [11]

    Find as many valid English words as possible

    Only include words you are confident exist in standard dictionaries OUTPUT FORMA T After your thinking, provide ONLY a clean list of valid words. - One word per line - No numbers, bullets, or punctuation - No explanations, notes, or commentary - No blank lines between words Start your word list now: Figure 1: Zero-shot prompt structure. The prompt specifi...

  12. [12]

    Standard Metrics We evaluate generation quality using precision, recall, and F1 score

    Evaluation Metrics 4.1. Standard Metrics We evaluate generation quality using precision, recall, and F1 score. For instance i, let Gi de- note generated words and Si denote the verified solution set. Precision is P=|G i ∩S i|/|Gi|, recall is R=|G i ∩S i|/|Si|, and F1 score is F1 = 2P R/(P+R). 4.2. Human Difficulty Alignment Using solver data from 10,000 u...

  13. [13]

    data” (93% human success, 89% model miss rate), “poop

    Results 5.1. Architecture and Scale Effects on Constraint Satisfaction Figure 2 reveals a performance hierarchy across model families. Proprietary models achieve F1 scores 2.0–2.2 × higher than the largest open- source configuration we tested, with the cross- family gap substantially exceeding improvements from parameter scaling within the Qwen family. Su...

  14. [14]

    data”, “poop

    Conclusion Systematic evaluation across 28 configurations reveals that constraint satisfaction performance depends on factors beyond parameter scaling. Cross-family differences (2.0–2.2×) substantially exceed within-family scaling gains (83%), mani- festing through recall rather than precision. Think- ing budget effects prove heterogeneous: the 14B model ...

  15. [15]

    Limitations The experimental paradigm focuses exclusively on English orthography, limiting cross-linguistic gener- alization, though the distributional plausibility mech- anism we identify may transfer to other alphabetic systems. The task emphasizes lexical retrieval and constraint satisfaction without requiring se- mantic understanding, isolating one co...

  16. [16]

    The experimental instances are drawn from the New Y ork Times Spelling Bee for research and evaluation purposes consistent with fair use princi- ples

    Ethical Considerations This work evaluates models on a publicly available word generation task, raising minimal ethical con- cerns. The experimental instances are drawn from the New Y ork Times Spelling Bee for research and evaluation purposes consistent with fair use princi- ples. We do not redistribute puzzle content beyond what is necessary for reprodu...

  17. [17]

    References Jacob Andreas. 2022. Language models as agent models. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Anthropic. 2025. Claude opus 4 & claude son- net 4 system card. Technical report, Anthropic. Models: Claude Opus 4, Claude S...

  18. [18]

    InProceedings of the 8th International Conference on Natural Lan- guage and Speech Processing (ICNLSP-2025), pages 242–257, Southern Denmark University, Odense, Denmark

    Tokenization and morphology in multi- lingual language models: A comparative analy- sis of mT5 and ByT5. InProceedings of the 8th International Conference on Natural Lan- guage and Speech Processing (ICNLSP-2025), pages 242–257, Southern Denmark University, Odense, Denmark. Association for Computa- tional Linguistics. Nouha Dziri, Ximing Lu, Melanie Sclar...

  19. [19]

    Models in a spelling bee: Language mod- els implicitly learn to spell. InNAACL. Abdelhak Kelious, Mathieu Constant, and Christophe Coeur. 2024. Complex word identification: A comparative study between ChatGPT and a dedicated model for this task. In Proceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources an...

  20. [20]

    Surv., 56(2)

    Recent advances in natural language pro- cessing via large pre-trained language models: A survey.ACM Comput. Surv., 56(2). Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Ri- era Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025. Beyond film subtitles: Is YouTube the best approximation of spoken vocabular...

  21. [21]

    Friedemann Pulvermüller

    Flexible and efficient grammar- constrained decoding. Friedemann Pulvermüller. 2010. Brain-language research: Where is the progress?Biolinguistics. Federico Raspanti, Tanir Ozcelebi, and Mike Holen- derski. 2025. Grammar-constrained decoding makes large language models better logical parsers. InProceedings of the 63rd Annual Meeting of the Association for...