Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs
Pith reviewed 2026-05-18 08:37 UTC · model grok-4.3
The pith
A modular system for creating constructed languages shows LLMs handle common grammatical patterns far more easily than rare ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present IASC, an interactive agentic system for ConLangs that creates phonology, morphology and syntax, lexicon, orthography, and a grammatical handbook through module-specific prompts, with refinement allowed by automatically generated commentary on previous outputs. Focusing on the morphosyntax module, the experiments demonstrate a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more typologically common patterns than rarer ones.
What carries the argument
The IASC modular prompt framework, which generates and refines ConLang components module by module using automatic commentary to isolate and test metalinguistic grammatical knowledge.
If this is right
- Different LLMs display markedly different success rates when generating consistent morphosyntactic rules for a new language.
- LLMs succeed more readily when the required patterns match those frequent across natural languages than when the patterns are typologically uncommon.
- The overall system supplies a practical, interactive tool that lets users build complete constructed languages through successive refinements.
- This method offers a route to probe LLMs' grasp of general linguistic concepts without relying on facts from any specific existing language.
Where Pith is reading between the lines
- If the observed gaps hold up, they suggest that models' apparent linguistic knowledge largely tracks statistical regularities in training data rather than abstract rule systems.
- The same modular approach could be adapted to test other layers of language understanding such as semantic role assignment or discourse coherence.
- Model developers could incorporate rare-pattern ConLang tasks as a diagnostic for identifying blind spots in grammatical generalization.
Load-bearing premise
That the modular prompt-based generation plus automatic commentary accurately isolates metalinguistic grammatical knowledge rather than testing the model's ability to follow complex instructions or to continue patterns seen in training data.
What would settle it
Re-running the morphosyntax generation tasks on the same models and linguistic specifications but without any module prompts, commentary steps, or ConLang framing and checking whether the performance advantage for typologically common patterns over rare ones disappears.
Figures
read the original abstract
We present a system that uses LLMs as a tool in the development of Constructed Languages -- ConLangs, which we call IASC (Interactive Agentic System for ConLangs). The system is modular in that it creates each of the components -- phonology, morphology and syntax, lexicon, orthography, and grammatical handbook, using module-specific sets of prompts. The approach is agentic in that various modules allow for refining the output given automatically-generated commentary on a previous step. Our main goals are twofold. First, we aim to provide tools that facilitate an engaging and enjoyable experience in creating artificially constructed languages. Second, the focus of this paper is on using our ConLang framework as a novel way to explore what LLMs 'know' about language -- not what they know about any particular language or encyclopedic facts, but how much they know about and understand language and linguistic concepts. In the experiments, we particularly focus on the morphosyntax module and show that there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more typologically common patterns than rarer ones. All code is released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IASC, a modular agentic LLM-based system for generating constructed languages (ConLangs) by separately prompting for phonology, morphology/syntax, lexicon, orthography, and a grammatical handbook, with iterative refinement driven by automatically generated commentary on prior outputs. The central empirical focus is the morphosyntax module, where experiments across multiple LLMs reveal performance differences both between models and between typologically common versus rare linguistic patterns, which the authors interpret as evidence of varying metalinguistic grammatical knowledge.
Significance. If the interpretation holds after addressing confounds, the work supplies a practical tool for ConLang creation while offering a novel probe for abstract linguistic knowledge in LLMs that goes beyond language-specific facts. The open release of code is a clear strength that enables reproducibility and extension. The significance is tempered by the need to demonstrate that observed gaps reflect conceptual understanding rather than instruction-following or pretraining biases.
major comments (3)
- [Experiments] Morphosyntax experiments: The central claim that performance differences reflect metalinguistic grammatical knowledge depends on the modular prompt system isolating linguistic concepts, yet no control conditions (e.g., non-linguistic rule-generation tasks or novel feature combinations) are reported to rule out confounds from prompt complexity, multi-step instruction following, or differential exposure to common vs. rare structures in pretraining data.
- [Experiments] Evaluation of results: The reported 'wide gulf' across models and pattern types is presented qualitatively without quantitative metrics, error bars, number of trials, or statistical tests, making it difficult to assess robustness or to compare effect sizes between typologically common and rare specifications.
- [IASC system description] Agentic refinement: The automatic commentary module that drives refinement is itself LLM-generated and not validated against human linguistic judgments or inter-annotator agreement, raising the possibility that observed differences partly reflect commentary quality or error accumulation rather than the target model's grammatical knowledge.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly separate the tool-building contribution from the probing contribution to clarify the paper's primary focus for readers.
- Notation for linguistic features and pattern types should be standardized and defined in a single table or section to improve readability when comparing common vs. rare specifications.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We value the opportunity to address the concerns raised regarding experimental controls, quantitative evaluation, and validation of the agentic components. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Experiments] Morphosyntax experiments: The central claim that performance differences reflect metalinguistic grammatical knowledge depends on the modular prompt system isolating linguistic concepts, yet no control conditions (e.g., non-linguistic rule-generation tasks or novel feature combinations) are reported to rule out confounds from prompt complexity, multi-step instruction following, or differential exposure to common vs. rare structures in pretraining data.
Authors: We agree that additional controls would help isolate whether differences stem from metalinguistic knowledge rather than other factors. Our design varies only the linguistic specifications (common vs. rare patterns) while holding the modular prompt structure constant across conditions, which provides some control for prompt complexity and instruction following. Nevertheless, we acknowledge the value of explicit controls such as non-linguistic rule tasks. In the revised manuscript we will add a dedicated limitations subsection discussing these potential confounds and include one simple control experiment comparing linguistic versus arbitrary symbolic rule generation. revision: partial
-
Referee: [Experiments] Evaluation of results: The reported 'wide gulf' across models and pattern types is presented qualitatively without quantitative metrics, error bars, number of trials, or statistical tests, making it difficult to assess robustness or to compare effect sizes between typologically common and rare specifications.
Authors: We accept this criticism. The current draft relies on qualitative descriptions of observed differences. For the revision we will report the exact number of trials per condition, success rates as percentages with standard error bars, and apply appropriate statistical tests (e.g., paired t-tests or Fisher's exact tests) to quantify differences both across models and between common versus rare linguistic patterns. These metrics will be added to the results section and figures. revision: yes
-
Referee: [IASC system description] Agentic refinement: The automatic commentary module that drives refinement is itself LLM-generated and not validated against human linguistic judgments or inter-annotator agreement, raising the possibility that observed differences partly reflect commentary quality or error accumulation rather than the target model's grammatical knowledge.
Authors: This is a fair observation. The commentary module is intentionally LLM-generated to enable fully automated iterative refinement. We will revise the system description to explicitly state this design choice and its scalability benefits while adding a limitations paragraph noting the absence of human validation. We will also outline future work that could include human annotation studies to measure commentary quality and inter-annotator agreement. revision: partial
Circularity Check
No circularity: empirical system and experimental results are self-contained
full rationale
The paper introduces an agentic modular LLM system (IASC) for generating ConLang components via prompt-based modules and automatic refinement, then reports experimental performance gaps on morphosyntax tasks across LLMs and linguistic specifications. These gaps are measured directly from system outputs on typologically common vs. rare patterns and do not reduce to any fitted parameters, self-defined quantities, or self-citation chains. No equations, uniqueness theorems, or ansatzes are present that loop back to the authors' inputs or prior work. The claims rest on observable experimental results with released code, making the work externally falsifiable and independent of its own construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM outputs under modular prompts plus automatic commentary reflect metalinguistic grammatical knowledge rather than instruction-following or training-data continuation.
invented entities (1)
-
IASC (Interactive Agentic System for ConLangs)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
B leu: a Method for Automatic Evaluation of Machine Translation
URLhttps://arxiv.org/abs/2408.09639. Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts. Mis- sion: Impossible language models, 2024. URLhttps://arxiv.org/abs/2401.06416. Anisia Katinskaia and Roman Yangarber. Probing the category of verbal aspect in transformer language models, 2024. URLhttps://arxiv.org/abs/2406.0...
-
[2]
URLhttps://arxiv.org/abs/2509.07389. Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, and Luke Melas-Kyriazi. A benchmark for learning to translate a new language from one grammar book, 2024. URLhttps://arxiv. org/abs/2309.16575. Lindia Tjuatja, Graham Neubig, Tal Linzen, and Sophie Hao. What goes into a LM acceptability judgment? rethinking the ...
-
[4]
", again separated by spaces. For example, the following is a valid output
Mark internal syllable boundaries with ".", again separated by spaces. For example, the following is a valid output. "o . k a . tSa" The following would be wrong, because the phonemes are not space-separated: "o.ka.tSa" Your code should include two global dictionaries, one named "consonants" and the other named "vowels". These should contain the phonemes ...
-
[5]
Within each morpheme, please place spaces between every phoneme
-
[6]
", again separated by spaces. For example, the following is a valid output
Mark internal syllable boundaries with ".", again separated by spaces. For example, the following is a valid output. "o . k a . tSa" The following would be wrong, because the phonemes are not space-separated: "o.ka.tSa" Your code should include two global dictionaries, one named "consonants" and the 36 **DRAFT** other named "vowels". These should contain ...
-
[7]
The text must be in English
-
[8]
Use simple words (5th grade reading level), and simple sentence structure
-
[9]
Please put the story you write within <OUTPUT></OUTPUT> tags
No more than about 500 words. Please put the story you write within <OUTPUT></OUTPUT> tags. </INSTRUCTIONS> Sentence design general guidelines <INSTRUCTIONS> <!-- Common to all the instructions. --> General guidelines: - All sentences must be in English. - Use simple words (5th grade reading level), and simple sentence structure. - Each sentence should be...
-
[10]
sentence two ... ... Collect your numbered sentences together into a single text and place that within the tags <OUTPUT></OUTPUT>. You may explain your rationale for the set of sentences you came up with, but please put NOTHING except the sentence text within the <OUTPUT></OUTPUT> tags. </INSTRUCTIONS> Comparative examples <INSTRUCTIONS> I'd like you to w...
-
[11]
NAME The first thing you need to do is come up with a LANGUAGE_NAME for the ConLang
-
[12]
PHONOLOGY Then you describe the phonology. The set of phonemes (in IPA) that you developed for this language, is as follows: Consonants: {{consonants}} Vowels: {{vowels}}
-
[13]
ORTHOGRAPHY Next you will describe the orthography, which is handled by the following code: # Begin orthography code {{orthography_code}} # End orthography code
-
[14]
LEXICON A subset of the lexicon with terms needed for the sample story is as follows: # Begin lexicon {{lexicon}} # End lexicon
-
[15]
MORPHOSYNTAX The morphosyntactic description on which you based your morphosyntax was as follows. Below I remind you what the different annotations mean, and which particular annotations were chosen for our language: NOMINAL MARKING Number: SING: Singular (one) PLUR: Plural (more than one) DUAL: Dual (exactly two) Our language has: "{{number_marking}}". C...
-
[16]
COMMON, NEUT For personal names, gender appropriate affixes were chosen. For common nouns, since English does not have gender, you made a decision as to what gender to assign to the noun, with the injunction: be creative, but be consistent! Our language marks "{{gender_marking}}". NUMERAL CLASSIFIERS Many languages have classifiers that go with numbers, a...
-
[17]
SYNTAX/WORD ORDER Main: VSO: "drink John water" SVO: "John drink water" SOV: "John water drink" VOS: "drink water John" 48 **DRAFT** OVS: "water drink John" OSV: "water John drink" Adjective-Noun: NA: "book red" AN: "red book" Adposition-Noun: PN: "on table" NP: "table on" Our language has "{{main_word_order}}" main word order. "{{adj_noun_word_order}}" a...
-
[18]
TEXTS An example text is as follows, where after "==" the first line is the morphosyntactic gloss, the second is the phonetic transcription and the final is the orthographic transcription: # BEGIN EXAMPLE TEXT {{sample_text}} # END EXAMPLE TEXT ========================== Arrange your handbook according to the format specified between the BEGIN and END lin...
work page 2004
-
[19]
Removed nasal vowels ( ˜e:, õ:, ˜ı:,˜u:): The rule “V[+nas]→V[-nas]” indicates that all nasal vowels became non-nasal
-
[20]
Retained ‘h’: This phoneme was present in Anglo-Frisian and there’s no rule indicating its loss, so it should be kept in Old English
-
[21]
Removed ’˜A’: The rule “˜A:→õ:” suggests this phoneme changed, and then the nasalization was lost as per the second rule
-
[22]
The original ‘k’ should still exist in other contexts
Retained ’k’: While there’s a rule about ‘k’ becoming ‘tS’ in certain circumstances, it’s not a blanket change. The original ‘k’ should still exist in other contexts
-
[23]
Added ‘J’: This comes from the rule “kGg→tS JdZ” although it’s noted to occur only in certain circumstances
-
[24]
Retained ‘u’ and ‘i’: The rule about their loss is context-dependent (word-final position), so they should still exist in other positions. The second revised version was as follows: a, a:, b, d, dZ, e, e:, f, g, h, i, i:, j, k, k w, l, m, n, o, o:, p, r, s, t, tS, u, u:, w, x, æ, æ:, J,T. According to Wikipedia13, the actual Old English phoneme inventory ...
work page 2019
-
[25]
Fixed the spacing issue by ensuring that all phonemes are properly space-separated, includ- ing word and syllable boundaries
-
[26]
Corrected the ‘consonant_lenition’ function to properly apply the rules for t→ð, k→x, and p→f between vowels
-
[27]
Fixed the ‘vowel_nasalization’ function to maintain the space between the nasalized vowel and the following nasal consonant
-
[28]
Improved the ‘vowel_reduction’ function to correctly reduce vowels in non-final sylla- bles and the final syllable of words with more than one syllable
-
[29]
Updated the vowel set in various functions to include the new ‘0’ phoneme introduced by the vowel shift. The second round code is as follows: 1import re 2 3def vowel_shift(inp: str) -> str: 4# Example inputs/outputs: 5# u ->0 6# o -> u 7#O-> o 8vowel_map = { 9'u':'0', 10'o':'u', 11'O':'o', 12'E':'e', 13'æ':'E' 14} 15return' '.join(vowel_map.get(phone, pho...
-
[30]
Devise a set of rules that makes sense for the phoneme set of our language given the observed phonotactics
-
[31]
Implement each of these rules in Python
-
[32]
Put the rules together into a Python library. Each rule should take the form of a python function that takes as input a string of space-delimited phonemes and possible syllable boundaries, and outputs a string in the same format. You must make sure that the output phonemes are space-delimited as in the input. All your rules must allow for the presence of ...
-
[33]
All phonemes are space-separated. All your rules must take this into account. Be careful: make sure you allow for EXACTLY ONE space between phonemes since rules that depend on more than one space being there will not work. Your rules must also output space-separated phonemes. Thus if the input to a nasal assimilation rule is "n p", the output should be "m...
-
[34]
If you use the Python regex library and make use of groups, make sure you have enough capturing groups to support the number of back references you assume. A common error is to have backreferences like "\\1\\2", but only have a single previous capturing group
-
[35]
sre_constants.error: look-behind requires fixed-width pattern
Avoid using regex look-behind since you inevitably miss the point that look-behind patterns are fixed width, which triggers the "sre_constants.error: look-behind requires fixed-width pattern" error
-
[36]
Make sure you have imports for all needed libraries in your code
Finally, it is OK if your rules introduce phonemes that are NOT in the input phoneme set since, after all, that is what sound change is all about. Make sure you have imports for all needed libraries in your code. Explain your reasoning. Then place your resulting code in the block <OUTPUT></OUTPUT>. This task will depend on your deep knowledge of historica...
work page 1985
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.