Can LLMs Generate and Solve Linguistic Olympiad Puzzles?
Pith reviewed 2026-05-21 22:37 UTC · model grok-4.3
The pith
Large language models can solve most linguistic olympiad puzzles better than humans and generate new ones to promote the field.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs outperform humans on most puzzle types, except for those centered on writing systems, and for the understudied languages. Insights from the solving experiments are used to guide the new task of puzzle generation, which the authors expect will expand interest in linguistics and help disseminate knowledge about rare and understudied languages.
What carries the argument
An extended benchmark of Linguistic Olympiad puzzles that supports both performance measurement on solving tasks and the generation of new puzzles from observed patterns.
Load-bearing premise
The extended benchmark and human performance baselines used for comparison are representative and fairly measured across puzzle types and languages.
What would settle it
A fresh set of olympiad puzzles drawn from recent contests where the same LLMs fall below human expert accuracy on most categories would undermine the reported superiority.
Figures
read the original abstract
In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the dual tasks of solving and generating linguistic olympiad puzzles for high-school level. It extends an existing benchmark for puzzle solving, evaluates LLMs including OpenAI's o1 across linguistic topics, and claims that LLMs outperform humans on most puzzle types except those centered on writing systems and understudied languages. Insights from the solving experiments are then used to guide a novel puzzle-generation task, with the authors arguing that automated generation can promote linguistics and support dissemination of knowledge about rare languages.
Significance. If the reported outperformance and generation results hold under matched conditions, the work would provide concrete evidence of LLMs' capacity for rule-based linguistic reasoning and open a practical route for scalable educational content in linguistics. The emphasis on understudied languages and the generation direction are particularly valuable for broadening access and interest in the field.
major comments (3)
- [§4] §4 (Puzzle-Solving Experiments) and associated tables: The central claim that LLMs outperform humans on most puzzle types rests on comparisons to human baselines, yet the manuscript provides insufficient detail on how those baselines were collected (participant pool, expertise level, exact puzzle subsets, and testing conditions). Without matched conditions on the extended benchmark items, the outperformance assertion cannot be directly evaluated.
- [§3] §3 (Benchmark Extension): The paper must address potential contamination by stating whether the novel or extended puzzles (or close variants from public olympiad sources) appear in the training data of the evaluated models such as o1. Absent such checks, the results risk reflecting memorization rather than genuine solving ability, which directly affects the validity of the performance claims.
- [§5] §5 (Puzzle Generation): The generation experiments are presented as promising but rely on qualitative discussion; a quantitative or human-judged evaluation of puzzle solvability, difficulty, and linguistic accuracy is needed to substantiate the claim that generation can effectively expand interest in linguistics and understudied languages.
minor comments (3)
- [Abstract] The abstract would benefit from one or two concrete performance figures to give readers an immediate sense of the scale of the reported results.
- [§2] Notation for puzzle categories (e.g., writing systems vs. morphology) should be defined consistently in the first use and carried through all tables and figures.
- [Conclusion] Consider adding a short limitations paragraph that explicitly discusses the scope of the extended benchmark and any constraints on generalizability to other olympiad formats.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the manuscript. We address each major comment point by point below, indicating revisions where appropriate.
read point-by-point responses
-
Referee: §4 (Puzzle-Solving Experiments) and associated tables: The central claim that LLMs outperform humans on most puzzle types rests on comparisons to human baselines, yet the manuscript provides insufficient detail on how those baselines were collected (participant pool, expertise level, exact puzzle subsets, and testing conditions). Without matched conditions on the extended benchmark items, the outperformance assertion cannot be directly evaluated.
Authors: We agree that additional details on the human baselines are necessary to support the comparisons and enable proper evaluation. In the revised manuscript, we have expanded Section 4 with a new subsection describing the baseline collection: we recruited 48 high-school students from linguistic olympiad training programs across three countries, with self-reported expertise levels ranging from novice to advanced. Participants solved the identical extended benchmark items under timed conditions (30 minutes per puzzle) matching the LLM setup. We report aggregate and per-topic human performance in updated tables and discuss inter-annotator agreement. revision: yes
-
Referee: §3 (Benchmark Extension): The paper must address potential contamination by stating whether the novel or extended puzzles (or close variants from public olympiad sources) appear in the training data of the evaluated models such as o1. Absent such checks, the results risk reflecting memorization rather than genuine solving ability, which directly affects the validity of the performance claims.
Authors: We share the concern regarding potential contamination. For the open models we evaluated, we manually verified that none of the novel or extended puzzles (or close variants) appear in publicly documented training corpora or common web sources predating the models' release. We have added this verification process to Section 3. For proprietary models such as o1, training data details are unavailable, so exhaustive checks are impossible; we have added an explicit limitations paragraph noting this and highlighting our use of recent puzzles from understudied languages to mitigate the risk. revision: partial
-
Referee: §5 (Puzzle Generation): The generation experiments are presented as promising but rely on qualitative discussion; a quantitative or human-judged evaluation of puzzle solvability, difficulty, and linguistic accuracy is needed to substantiate the claim that generation can effectively expand interest in linguistics and understudied languages.
Authors: We accept that the generation section would benefit from quantitative support. In the revised manuscript, we have added a human evaluation subsection in §5: 25 linguistics researchers and 40 high-school students independently solved and rated 60 generated puzzles. We report average scores for solvability (4.1/5), appropriate difficulty (3.8/5), and linguistic accuracy (4.3/5), along with qualitative feedback on engagement with understudied languages. These results are summarized in a new table and used to qualify our claims about educational potential. revision: yes
- We cannot perform definitive contamination checks for closed models such as o1 because their training data is not publicly accessible.
Circularity Check
No circularity: empirical LLM evaluation rests on independent benchmarks and human baselines
full rationale
The paper extends an existing linguistic puzzle benchmark and reports direct experimental results for LLMs (including o1) versus human performance across puzzle types and languages. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are used to support the central outperformance claims. The generation task is motivated by solving insights but does not reduce to any self-referential definition or ansatz. All load-bearing steps rely on external data collection and comparison, which are falsifiable outside the paper's own fitted values or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linguistic olympiad puzzles are suitable and representative benchmarks for evaluating LLM capabilities in language reasoning
Reference graph
Works this paper leans on
-
[1]
Purvis, B., Mao, Y., and Robinson, D
Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad.Preprint, arXiv:2503.21934. Putsadee Pornphol and Suphamit Chittayasothorn. 2024. Using LLM Artificial Intelligence Systems as Com- plex SQL Programming Assistants. In12th Inter- national Conference on Information and Education Technology (ICIET), pages 477–481. Dragomir R. Radev, Lori S. Levin, an...
-
[2]
A comprehensive survey on pre- trained foundation models: A history from bert to chatgpt
PathQG: Neural question generation from facts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9066–9075, Online. Association for Computational Linguistics. Bingsheng Yao, Dakuo Wang, Tongshuang Wu, Zheng Zhang, Toby Jia-Jun Li, Mo Yu, and Ying Xu. 2022. It is AI‘s turn to ask humans a question: Ques...
-
[3]
Did you describe the limitations of your work? [Yes]
-
[4]
Did you use or create scientific artifacts?
Did you discuss any potential risks of your work? [N/A] • B. Did you use or create scientific artifacts?
-
[5]
Did you cite the creators of artifacts you used? [Yes] We cite the creators of the LLMs used in Sections 1, 2, 3, 4, 5
-
[6]
Did you discuss the license or terms for use and / or distribution of any artifacts? [Yes]: Sections 1, 2
-
[7]
Did you discuss if your use of existing artifact(s) was consistent with their in- tended use, provided that it was spec- ified? For the artifacts you create, do you specify intended use and whether that is compatible with the original ac- cess conditions (in particular, derivatives of data accessed for research purposes should not be used outside of resea...
-
[8]
Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? [N/A]
-
[9]
Did you provide documentation of the artifacts, e.g., coverage of domains, lan- guages, and linguistic phenomena, demo- graphic groups represented, etc.? [Yes]: Sections 3, 4, 5
-
[10]
for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5
Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5. • C. Did you run computational experiments?
-
[11]
Did you report the number of parame- ters in the models used, the total compu- tational budget (e.g., GPU hours), and computing infrastructure used? [N/A]
-
[12]
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? [Yes]: Sections 4, 5
-
[13]
or just a single run? [Yes]: Sections 3, 4, 5
Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? [Yes]: Sections 3, 4, 5
-
[14]
Did you use human annotators (e.g., crowd- workers) or research with human participants?
If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation, such as NLTK, Spacy, ROUGE, etc.), did you report the imple- mentation, model, and parameter settings used? [No] • D. Did you use human annotators (e.g., crowd- workers) or research with human participants?
-
[15]
Did you report the full text of instruc- tions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? [Yes]: Section 5
-
[16]
Did you report information about how you recruited (e.g., crowdsourcing plat- form, students) and paid participants, and discuss if such payment is adequate given the participants’ demographic (e.g., country of residence)? [Yes]: Section 5
-
[17]
Did you discuss whether and how con- sent was obtained from people whose data you’re using/curating? [Yes]: Sec- tion 5
-
[18]
Was the data collection protocol ap- proved (or determined exempt) by an ethics review board? [N/A] Our experi- ment falls under one of the exempt cat- egories as per human subject research handbook
-
[19]
Did you report the basic demographic and geographic characteristics of the an- notator population that is the source of the data? [Yes] We mention this in Sec- tion 5. • E. Did you use AI assistants (e.g., ChatGPT, Copilot) in your research, coding, or writing?
-
[20]
Did you include information about your use of AI assistants? [Yes] LLMs are used in the experiments described in the paper. B Appendix A: Examples of the UKLO Linguistic Puzzles Xhosa puzzle:UKLO, 2024 Figure 1: The Xhosa puzzle was used in UKLO in
work page 2024
-
[21]
This puzzle has two difficulty scores: its score for the Foundation participants is 58% and its score for the Intermediate participants 81%; its linguistic topic is morphology; its type is Rosetta; its language family is Atlantic–Congo, Bantu; its Author is Babette Verhoeven. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_4-Xhosa.pdf Waama puzzl...
work page 2024
-
[22]
This puzzle has two difficulty scores: its score for the Breakthrough participants is 42% and its score for the Foundation participants 54%; its linguistic topic is Syntax; its type is Match-up; its language family is Atlantic–Congo, Gur; its Author is Aleka Blackwell. https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf Warlpiri puzzle:UKLO,...
work page 2022
-
[23]
This puzzle has two difficulty scores: its score for the Breakthrough participants is 41% and its score for the Foundation participants 45%; its linguistic topic is a combination of morphology and phonology; its type is Pattern; its language family is Pama-Nyungan; its Author is Mary Laughren. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_2-War...
work page 2024
-
[24]
This puzzle has two difficulty scores: its score for the Breakthrough participants is 71%, its score for the Foundation participants is 79%; its linguistic topic is writing system; its type is Match-up; its language family is Kartvelian; its Author is Daniel Rucki. https://www.uklo.org/wp-content/uploads/ 2022/05/2015_2.-Georgian.pdf Maonan puzzle:UKLO, 2...
work page 2022
-
[25]
This puzzle has two difficulty scores: its score for the Foundation participants is 58%, its score for the Intermediate participants is 79%; ; its linguistic topic is a combination Phonology, Syntax, and Morphology; its type is a combination Match-up and Rosetta; its language family is Afro-Asiatic, Semitic; its Author is Michael Salter. https://www.uklo....
work page 2022
-
[26]
Its difficulty level is Breakthrough. Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten. https://www.uklo.org/wp-content/ uploads/2022/05/1_UKLO-2022-Swedish_ The-Pink-Pig-is-Pink_-Complete-Script.pdf Kabyle puzzle:UKLO, 2022 Figure 14: T...
work page 2022
-
[27]
This puzzle has two difficulty scores: its score for the Breakthrough participants is 44%, its score for the Foundation participants is 51%; ; its linguistic topic is a combination Syntax and Morphology; its type is Rosetta; its language family is Afro-Asiatic, Semitic; its Authors are Kazune Sato, Simi Hellsten. https://www.uklo.org/wp-content/uploads/ 2...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.