Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3
The pith
Model family differences create larger gaps than parameter scaling in satisfying orthographic constraints, with only modest human difficulty alignment and systematic failures on common words with unusual spellings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that differences across model families produce substantially larger performance gaps in orthographic constraint satisfaction than increases in parameter count within families, accompanied by modest calibration to human difficulty ratings and systematic errors on frequent words with atypical orthography such as data, loll, and acai.
What carries the argument
Orthographic constraint satisfaction tested through 58 word puzzles, measured via F1 scores, partial-correlation analysis separating family effects from scaling and tokenizer confounds, and direct comparison to human solver difficulty ratings.
If this is right
- Architectural or training differences between families affect orthographic rule following more than raw scale.
- Increased thinking budget improves results only for high-capacity models while mid-sized variants saturate or decline.
- Over-reliance on distributional plausibility causes consistent misses on constraint-valid but orthographically atypical patterns.
- Modest correlation with human ratings holds across families but does not eliminate the identified failure modes.
Where Pith is reading between the lines
- Efforts to improve constraint satisfaction may need family-specific techniques rather than uniform scaling laws.
- The evaluation method could transfer to other domains that require strict rule adherence, such as formal language generation or code synthesis.
- Persistent failures on common words suggest targeted interventions like orthography-specific training data to reduce distributional bias.
Load-bearing premise
The 58 word puzzles together with 10,000 human ratings per puzzle form a representative sample for drawing conclusions about orthographic constraint satisfaction and human alignment across large language models in general.
What would settle it
A new collection of word puzzles or constrained generation tasks where within-family parameter scaling produces performance gaps equal to or larger than the observed cross-family differences of 2.0-2.2x.
Figures
read the original abstract
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-family evaluation remains limited. We evaluate 39 configurations spanning three model families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Cross-family differences produce substantially larger performance gaps (2.0-2.2x, F1 = 0.761 vs. 0.343) than parameter scaling within families (83% gain from 4B to 32B scaling), and a partial-correlation analysis rules out tokenizer design as a confound for within-family scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade, showing inconsistent compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (\r{ho} = 0.28-0.42) across all families, yet identify systematic failures on common words with unusual orthography ("data", "loll", "acai": 83-91% human success, 94-98% model miss rate). These failures point to over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates 39 configurations across three LLM families (Qwen3, Claude Haiku 4.5, GPT-5-mini) on 58 word puzzles that require satisfying character-level orthographic constraints. It reports substantially larger performance gaps between families (F1 scores of 0.761 vs. 0.343) than within-family scaling gains (83% improvement from 4B to 32B parameters), heterogeneous returns to increased thinking budget, modest but consistent calibration with human difficulty ratings (Spearman rho = 0.28-0.42), and systematic model failures on common words with atypical orthography (e.g., 'data', 'loll', 'acai') despite high human success rates (83-91%). A partial-correlation analysis is used to rule out tokenizer design as a confound for scaling effects.
Significance. If the empirical patterns hold, the results indicate that model family and architecture exert a stronger influence on orthographic constraint satisfaction than parameter scaling alone, while also revealing a systematic bias toward distributional plausibility that diverges from human performance on atypical but valid patterns. The large-scale human rating collection (10,000 solvers per puzzle) provides a solid basis for the calibration claims and could inform future work on constrained generation and cognitive alignment in LLMs.
major comments (2)
- The selection criteria, diversity metrics (word length, frequency, orthographic rarity distribution), and coverage analysis for the 58 word puzzles are not described. This is load-bearing for the central claims comparing cross-family gaps to within-family scaling and for the reported systematic failure modes, because the observed 2.0-2.2x family differences and specific error patterns on atypical orthography could be idiosyncratic to this puzzle set rather than general properties of the model families.
- The partial-correlation analysis that rules out tokenizer design as a confound for the within-family scaling result (83% gain) is mentioned but lacks the specific controlled variables, coefficient values, and statistical significance details needed to evaluate its robustness.
minor comments (2)
- The abstract uses the notation 'r{ho}' for Spearman correlation; this should be corrected to the standard rho symbol for clarity.
- A summary table listing the 39 configurations, their parameter counts, and family membership would improve readability of the experimental setup.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions made.
read point-by-point responses
-
Referee: The selection criteria, diversity metrics (word length, frequency, orthographic rarity distribution), and coverage analysis for the 58 word puzzles are not described. This is load-bearing for the central claims comparing cross-family gaps to within-family scaling and for the reported systematic failure modes, because the observed 2.0-2.2x family differences and specific error patterns on atypical orthography could be idiosyncratic to this puzzle set rather than general properties of the model families.
Authors: We acknowledge this omission and agree that explicit documentation of the puzzle selection process is essential for evaluating the generalizability of our findings. In the revised manuscript, we have added a detailed description in the Methods section covering the selection criteria, including stratification by word length, log-frequency from the Google Ngram corpus, and orthographic rarity based on n-gram probabilities. We also include quantitative diversity metrics and a coverage analysis to characterize the distribution of these properties in the 58-puzzle set. These additions support the robustness of the reported cross-family performance gaps and the identification of systematic failure modes on atypical orthography. revision: yes
-
Referee: The partial-correlation analysis that rules out tokenizer design as a confound for the within-family scaling result (83% gain) is mentioned but lacks the specific controlled variables, coefficient values, and statistical significance details needed to evaluate its robustness.
Authors: We appreciate the referee highlighting the need for greater transparency in the statistical analysis. We have revised the manuscript to provide a full account of the partial-correlation analysis, including the specific variables controlled for (such as tokenizer vocabulary size and model family), the computed partial correlation coefficients, and associated p-values. This expanded description confirms that the observed scaling benefits within families remain significant after accounting for tokenizer-related factors. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper's central claims rest on direct empirical measurements: F1 scores from 39 model configurations across 58 word puzzles, cross-family performance gaps, within-family scaling gains, partial-correlation analysis to rule out tokenizer confounds, and Spearman correlations (rho = 0.28-0.42) with 10,000 human difficulty ratings per puzzle. No equations, fitted parameters presented as predictions, self-citations bearing the load of uniqueness or ansatz, or renamings of known results appear in the abstract or described methodology. All reported quantities are falsifiable against the fixed puzzle set and external human data, rendering the derivation self-contained with no reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 58 selected word puzzles sufficiently represent the space of orthographic constraint satisfaction challenges.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. ... modest but consistent calibration (r=0.24–0.38)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Cross-family differences produce substantially larger performance gaps (2.0–2.2×, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction How do large language models satisfy hard ortho- graphic constraints during text generation? Con- sider a model asked to generate English words us- ing only the letters {A, G, I, L, N, O, W}, where every word must contain W and have at least four letters. This task requires satisfying discrete character- level rules, a fundamental challenge d...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[2]
Cross-family performance characterization (Section 5.1): Architectural differences pro- duce substantially larger performance gaps (2.0–2.2×) than parameter scaling within fam- ilies (83% gain from eightfold increase), with the gap manifesting primarily through recall rather than precision. Constraint satisfaction may require specialized architectural fea...
-
[3]
These patterns are inconsistent with uniform com- pute scaling
Heterogeneous budget sensitivity(Sec- tion 5.2): Thinking budget effects vary dra- matically across models: high-capacity vari- ants show strong returns (+0.102 to +0.136 F1), mid-size models degrade monotonically with increased allocation, and the mixture-of- experts 30B variant requires substantial bud- get for productive expert composition. These patte...
-
[4]
Human difficulty alignment with mechanis- tic failure analysis(Sections 5.3, 5.4): Using 10,000 solver ratings per puzzle, we estab- lish modest but consistent calibration (r=0.24– 0.38) across families, yet identify systematic failures on common words with unusual or- thography. These failures reveal over-reliance on distributional plausibility that pena...
-
[5]
Related Work Orthographic Knowledge in Language Models. Language models encode orthographic structure beyond token frequencies (Itzhak et al., 2022), yet limitations persist: multilingual LLMs over-weight orthographic similarity when processing interlin- gual homographs (Tanwar et al., 2025), show systematic gaps in grapheme-to-phoneme map- ping (Suvarna ...
work page 2022
-
[6]
wagon” is valid (uses only available letters, includes W, length ≥ 4) while “along
Experimental Setup 3.1. Task Definition Models generate English words satisfying explicit orthographic constraints. Each puzzle specifies seven unique letters, one designated as manda- tory. Valid outputs must satisfy three constraints: minimum four letters, exclusive use of the seven available letters (repetition permitted), and manda- tory inclusion of ...
work page 2025
-
[7]
Start with the center letter and build words around it
-
[8]
Focus on word patterns: common roots, prefixes, suffixes
-
[9]
Skip words with letters NOT in the available set
-
[10]
Find as many valid English words as possible
-
[11]
Find as many valid English words as possible
Only include words you are confident exist in standard dictionaries OUTPUT FORMA T After your thinking, provide ONLY a clean list of valid words. - One word per line - No numbers, bullets, or punctuation - No explanations, notes, or commentary - No blank lines between words Start your word list now: Figure 1: Zero-shot prompt structure. The prompt specifi...
-
[12]
Standard Metrics We evaluate generation quality using precision, recall, and F1 score
Evaluation Metrics 4.1. Standard Metrics We evaluate generation quality using precision, recall, and F1 score. For instance i, let Gi de- note generated words and Si denote the verified solution set. Precision is P=|G i ∩S i|/|Gi|, recall is R=|G i ∩S i|/|Si|, and F1 score is F1 = 2P R/(P+R). 4.2. Human Difficulty Alignment Using solver data from 10,000 u...
-
[13]
data” (93% human success, 89% model miss rate), “poop
Results 5.1. Architecture and Scale Effects on Constraint Satisfaction Figure 2 reveals a performance hierarchy across model families. Proprietary models achieve F1 scores 2.0–2.2 × higher than the largest open- source configuration we tested, with the cross- family gap substantially exceeding improvements from parameter scaling within the Qwen family. Su...
-
[14]
Conclusion Systematic evaluation across 28 configurations reveals that constraint satisfaction performance depends on factors beyond parameter scaling. Cross-family differences (2.0–2.2×) substantially exceed within-family scaling gains (83%), mani- festing through recall rather than precision. Think- ing budget effects prove heterogeneous: the 14B model ...
-
[15]
Limitations The experimental paradigm focuses exclusively on English orthography, limiting cross-linguistic gener- alization, though the distributional plausibility mech- anism we identify may transfer to other alphabetic systems. The task emphasizes lexical retrieval and constraint satisfaction without requiring se- mantic understanding, isolating one co...
-
[16]
Ethical Considerations This work evaluates models on a publicly available word generation task, raising minimal ethical con- cerns. The experimental instances are drawn from the New Y ork Times Spelling Bee for research and evaluation purposes consistent with fair use princi- ples. We do not redistribute puzzle content beyond what is necessary for reprodu...
-
[17]
References Jacob Andreas. 2022. Language models as agent models. InFindings of the Association for Com- putational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Anthropic. 2025. Claude opus 4 & claude son- net 4 system card. Technical report, Anthropic. Models: Claude Opus 4, Claude S...
-
[18]
Tokenization and morphology in multi- lingual language models: A comparative analy- sis of mT5 and ByT5. InProceedings of the 8th International Conference on Natural Lan- guage and Speech Processing (ICNLSP-2025), pages 242–257, Southern Denmark University, Odense, Denmark. Association for Computa- tional Linguistics. Nouha Dziri, Ximing Lu, Melanie Sclar...
work page 2025
-
[19]
Models in a spelling bee: Language mod- els implicitly learn to spell. InNAACL. Abdelhak Kelious, Mathieu Constant, and Christophe Coeur. 2024. Complex word identification: A comparative study between ChatGPT and a dedicated model for this task. In Proceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Resources an...
work page 2024
-
[20]
Recent advances in natural language pro- cessing via large pre-trained language models: A survey.ACM Comput. Surv., 56(2). Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Ri- era Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025. Beyond film subtitles: Is YouTube the best approximation of spoken vocabular...
work page 2025
-
[21]
Flexible and efficient grammar- constrained decoding. Friedemann Pulvermüller. 2010. Brain-language research: Where is the progress?Biolinguistics. Federico Raspanti, Tanir Ozcelebi, and Mike Holen- derski. 2025. Grammar-constrained decoding makes large language models better logical parsers. InProceedings of the 63rd Annual Meeting of the Association for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.