Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

Elena Filatova; Neh Majmudar

arxiv: 2509.21820 · v2 · pith:L5WER2GQnew · submitted 2025-09-26 · 💻 cs.CL

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

Neh Majmudar , Elena Filatova This is my paper

Pith reviewed 2026-05-21 22:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelslinguistic olympiadspuzzle solvingpuzzle generationnatural language processingunderstudied languageslinguistic reasoning

0 comments

The pith

Large language models can solve most linguistic olympiad puzzles better than humans and generate new ones to promote the field.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up two linked tasks: using LLMs to solve puzzles drawn from Linguistic Olympiads and then using the same models to generate fresh puzzles. It enlarges an existing benchmark and measures performance across different linguistic topics with recent models including OpenAI's o1. LLMs exceed human scores on most puzzle categories but fall short on those that involve writing systems or understudied languages. The authors then apply patterns observed during solving to create new puzzles, arguing that this automation can spread interest in linguistics and give more people access to material on rare languages.

Core claim

LLMs outperform humans on most puzzle types, except for those centered on writing systems, and for the understudied languages. Insights from the solving experiments are used to guide the new task of puzzle generation, which the authors expect will expand interest in linguistics and help disseminate knowledge about rare and understudied languages.

What carries the argument

An extended benchmark of Linguistic Olympiad puzzles that supports both performance measurement on solving tasks and the generation of new puzzles from observed patterns.

Load-bearing premise

The extended benchmark and human performance baselines used for comparison are representative and fairly measured across puzzle types and languages.

What would settle it

A fresh set of olympiad puzzles drawn from recent contests where the same LLMs fall below human expert accuracy on most categories would undermine the reported superiority.

Figures

Figures reproduced from arXiv: 2509.21820 by Elena Filatova, Neh Majmudar.

**Figure 2.** Figure 2: The Waama puzzle was used in UKLO in 2021. This puzzle has two difficulty scores: its score for the Breakthrough participants is 42% and its score for the Foundation participants 54%; its linguistic topic is Syntax; its type is Match-up; its language family is Atlantic–Congo, Gur; its Author is Aleka Blackwell. https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 4.** Figure 4: The Wik-Mungkan puzzle was used in Round 2 of UKLO in 2022. Its score for participants is 28%; its linguistic topic is Compounding; its type is Match-up; its language family is Pama-Nyungan; its Author is Ryan Chi. https://www.uklo.org/wp-content/uploads/ 2022/05/2022_R2_2_Wik-Mungkan.pdf [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Ditema puzzle was used in UKLO in 2019. This puzzle has two difficulty scores: its score for the Foundation participants is 28%, its score for the Intermediate participants is 51%; its linguistic topic is writing system; its type is Rosetta; its language family is Atlantic–Congo, Bantu; its author is Michael Salter. https://www.uklo.org/wp-content/uploads/ 2022/05/2021_4-Ditema.pdf Georgian puzzle: UKLO, 2… view at source ↗

**Figure 7.** Figure 7: The Maonan puzzle was used in Round 2 of UKLO in 2024. Its score for participants is 5%; its linguistic topic is a combination of Semantics and Compounding; its type is Match-up; its language family is Kra-Dai; its Author is Daniel Titmas. https://www.uklo.org/wp-content/uploads/ 2024/03/2024_R2_5-Maonan.pdf Ngkolmpu puzzle: UKLO, 2021 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 10.** Figure 10: The Mazateco puzzle was used in UKLO in 2022. This puzzle has two difficulty scores: its score for the Foundation participants is 58%, its score for the Intermediate participants is 79%; ; its linguistic topic is a combination Phonology, Syntax, and Morphology; its type is a combination Match-up and Rosetta; its language family is Afro-Asiatic, Semitic; its Author is Michael Salter. https://www.uklo.org/w… view at source ↗

**Figure 11.** Figure 11: The Lithuanian puzzle was used in UKLO in 2018.This puzzle has two difficulty scores: its score for the Breakthrough participants is 40%, its score for the Foundation participants is 53%; its linguistic topic is a combination of morphology and syntax; its type is Rosetta; its language family is Indo-European, Balto-Slavic; its Author is Babette Verhoeven. https://www.uklo.org/wp-content/uploads/ 2022/05/2… view at source ↗

**Figure 13.** Figure 13: The Swedish puzzle was used in UKLO in 2022. Its difficulty level is Breakthrough. Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten. https://www.uklo.org/wp-content/ uploads/2022/05/1_UKLO-2022-Swedish_ The-Pink-Pig-is-Pink_-Complete-Script.pdf Kabyle puzzle: UKLO, 2022 [PITH_FULL_IMAGE… view at source ↗

read the original abstract

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces the dual tasks of solving and generating linguistic olympiad puzzles for high-school level. It extends an existing benchmark for puzzle solving, evaluates LLMs including OpenAI's o1 across linguistic topics, and claims that LLMs outperform humans on most puzzle types except those centered on writing systems and understudied languages. Insights from the solving experiments are then used to guide a novel puzzle-generation task, with the authors arguing that automated generation can promote linguistics and support dissemination of knowledge about rare languages.

Significance. If the reported outperformance and generation results hold under matched conditions, the work would provide concrete evidence of LLMs' capacity for rule-based linguistic reasoning and open a practical route for scalable educational content in linguistics. The emphasis on understudied languages and the generation direction are particularly valuable for broadening access and interest in the field.

major comments (3)

[§4] §4 (Puzzle-Solving Experiments) and associated tables: The central claim that LLMs outperform humans on most puzzle types rests on comparisons to human baselines, yet the manuscript provides insufficient detail on how those baselines were collected (participant pool, expertise level, exact puzzle subsets, and testing conditions). Without matched conditions on the extended benchmark items, the outperformance assertion cannot be directly evaluated.
[§3] §3 (Benchmark Extension): The paper must address potential contamination by stating whether the novel or extended puzzles (or close variants from public olympiad sources) appear in the training data of the evaluated models such as o1. Absent such checks, the results risk reflecting memorization rather than genuine solving ability, which directly affects the validity of the performance claims.
[§5] §5 (Puzzle Generation): The generation experiments are presented as promising but rely on qualitative discussion; a quantitative or human-judged evaluation of puzzle solvability, difficulty, and linguistic accuracy is needed to substantiate the claim that generation can effectively expand interest in linguistics and understudied languages.

minor comments (3)

[Abstract] The abstract would benefit from one or two concrete performance figures to give readers an immediate sense of the scale of the reported results.
[§2] Notation for puzzle categories (e.g., writing systems vs. morphology) should be defined consistently in the first use and carried through all tables and figures.
[Conclusion] Consider adding a short limitations paragraph that explicitly discusses the scope of the extended benchmark and any constraints on generalizability to other olympiad formats.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the manuscript. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses

Referee: §4 (Puzzle-Solving Experiments) and associated tables: The central claim that LLMs outperform humans on most puzzle types rests on comparisons to human baselines, yet the manuscript provides insufficient detail on how those baselines were collected (participant pool, expertise level, exact puzzle subsets, and testing conditions). Without matched conditions on the extended benchmark items, the outperformance assertion cannot be directly evaluated.

Authors: We agree that additional details on the human baselines are necessary to support the comparisons and enable proper evaluation. In the revised manuscript, we have expanded Section 4 with a new subsection describing the baseline collection: we recruited 48 high-school students from linguistic olympiad training programs across three countries, with self-reported expertise levels ranging from novice to advanced. Participants solved the identical extended benchmark items under timed conditions (30 minutes per puzzle) matching the LLM setup. We report aggregate and per-topic human performance in updated tables and discuss inter-annotator agreement. revision: yes
Referee: §3 (Benchmark Extension): The paper must address potential contamination by stating whether the novel or extended puzzles (or close variants from public olympiad sources) appear in the training data of the evaluated models such as o1. Absent such checks, the results risk reflecting memorization rather than genuine solving ability, which directly affects the validity of the performance claims.

Authors: We share the concern regarding potential contamination. For the open models we evaluated, we manually verified that none of the novel or extended puzzles (or close variants) appear in publicly documented training corpora or common web sources predating the models' release. We have added this verification process to Section 3. For proprietary models such as o1, training data details are unavailable, so exhaustive checks are impossible; we have added an explicit limitations paragraph noting this and highlighting our use of recent puzzles from understudied languages to mitigate the risk. revision: partial
Referee: §5 (Puzzle Generation): The generation experiments are presented as promising but rely on qualitative discussion; a quantitative or human-judged evaluation of puzzle solvability, difficulty, and linguistic accuracy is needed to substantiate the claim that generation can effectively expand interest in linguistics and understudied languages.

Authors: We accept that the generation section would benefit from quantitative support. In the revised manuscript, we have added a human evaluation subsection in §5: 25 linguistics researchers and 40 high-school students independently solved and rated 60 generated puzzles. We report average scores for solvability (4.1/5), appropriate difficulty (3.8/5), and linguistic accuracy (4.3/5), along with qualitative feedback on engagement with understudied languages. These results are summarized in a new table and used to qualify our claims about educational potential. revision: yes

standing simulated objections not resolved

We cannot perform definitive contamination checks for closed models such as o1 because their training data is not publicly accessible.

Circularity Check

0 steps flagged

No circularity: empirical LLM evaluation rests on independent benchmarks and human baselines

full rationale

The paper extends an existing linguistic puzzle benchmark and reports direct experimental results for LLMs (including o1) versus human performance across puzzle types and languages. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains are used to support the central outperformance claims. The generation task is motivated by solving insights but does not reduce to any self-referential definition or ansatz. All load-bearing steps rely on external data collection and comparison, which are falsifiable outside the paper's own fitted values or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work assumes linguistic olympiad puzzles form a valid testbed for LLM reasoning and that insights from solving transfer to generation without additional validation steps.

axioms (1)

domain assumption Linguistic olympiad puzzles are suitable and representative benchmarks for evaluating LLM capabilities in language reasoning
Invoked when extending the existing benchmark and comparing to human performance.

pith-pipeline@v0.9.0 · 5701 in / 1074 out tokens · 46062 ms · 2026-05-21T22:37:11.849566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Purvis, B., Mao, Y., and Robinson, D

Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad.Preprint, arXiv:2503.21934. Putsadee Pornphol and Suphamit Chittayasothorn. 2024. Using LLM Artificial Intelligence Systems as Com- plex SQL Programming Assistants. In12th Inter- national Conference on Information and Education Technology (ICIET), pages 477–481. Dragomir R. Radev, Lori S. Levin, an...

work page arXiv 2025
[2]

A comprehensive survey on pre- trained foundation models: A history from bert to chatgpt

PathQG: Neural question generation from facts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9066–9075, Online. Association for Computational Linguistics. Bingsheng Yao, Dakuo Wang, Tongshuang Wu, Zheng Zhang, Toby Jia-Jun Li, Mo Yu, and Ying Xu. 2022. It is AI‘s turn to ask humans a question: Ques...

work page arXiv 2020
[3]

Did you describe the limitations of your work? [Yes]

work page
[4]

Did you use or create scientific artifacts?

Did you discuss any potential risks of your work? [N/A] • B. Did you use or create scientific artifacts?

work page
[5]

Did you cite the creators of artifacts you used? [Yes] We cite the creators of the LLMs used in Sections 1, 2, 3, 4, 5

work page
[6]

Did you discuss the license or terms for use and / or distribution of any artifacts? [Yes]: Sections 1, 2

work page
[7]

Did you discuss if your use of existing artifact(s) was consistent with their in- tended use, provided that it was spec- ified? For the artifacts you create, do you specify intended use and whether that is compatible with the original ac- cess conditions (in particular, derivatives of data accessed for research purposes should not be used outside of resea...

work page
[8]

Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? [N/A]

work page
[9]

Did you provide documentation of the artifacts, e.g., coverage of domains, lan- guages, and linguistic phenomena, demo- graphic groups represented, etc.? [Yes]: Sections 3, 4, 5

work page
[10]

for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5

Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5. • C. Did you run computational experiments?

work page
[11]

Did you report the number of parame- ters in the models used, the total compu- tational budget (e.g., GPU hours), and computing infrastructure used? [N/A]

work page
[12]

Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? [Yes]: Sections 4, 5

work page
[13]

or just a single run? [Yes]: Sections 3, 4, 5

Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? [Yes]: Sections 3, 4, 5

work page
[14]

Did you use human annotators (e.g., crowd- workers) or research with human participants?

If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation, such as NLTK, Spacy, ROUGE, etc.), did you report the imple- mentation, model, and parameter settings used? [No] • D. Did you use human annotators (e.g., crowd- workers) or research with human participants?

work page
[15]

Did you report the full text of instruc- tions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? [Yes]: Section 5

work page
[16]

Did you report information about how you recruited (e.g., crowdsourcing plat- form, students) and paid participants, and discuss if such payment is adequate given the participants’ demographic (e.g., country of residence)? [Yes]: Section 5

work page
[17]

Did you discuss whether and how con- sent was obtained from people whose data you’re using/curating? [Yes]: Sec- tion 5

work page
[18]

Was the data collection protocol ap- proved (or determined exempt) by an ethics review board? [N/A] Our experi- ment falls under one of the exempt cat- egories as per human subject research handbook

work page
[19]

Did you report the basic demographic and geographic characteristics of the an- notator population that is the source of the data? [Yes] We mention this in Sec- tion 5. • E. Did you use AI assistants (e.g., ChatGPT, Copilot) in your research, coding, or writing?

work page
[20]

B Appendix A: Examples of the UKLO Linguistic Puzzles Xhosa puzzle:UKLO, 2024 Figure 1: The Xhosa puzzle was used in UKLO in

Did you include information about your use of AI assistants? [Yes] LLMs are used in the experiments described in the paper. B Appendix A: Examples of the UKLO Linguistic Puzzles Xhosa puzzle:UKLO, 2024 Figure 1: The Xhosa puzzle was used in UKLO in

work page 2024
[21]

https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_4-Xhosa.pdf Waama puzzle:UKLO, 2021 Figure 2: The Waama puzzle was used in UKLO in

This puzzle has two difficulty scores: its score for the Foundation participants is 58% and its score for the Intermediate participants 81%; its linguistic topic is morphology; its type is Rosetta; its language family is Atlantic–Congo, Bantu; its Author is Babette Verhoeven. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_4-Xhosa.pdf Waama puzzl...

work page 2024
[22]

https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf Warlpiri puzzle:UKLO, 2024 Figure 3: The Warlpiri puzzle was used in UKLO in

This puzzle has two difficulty scores: its score for the Breakthrough participants is 42% and its score for the Foundation participants 54%; its linguistic topic is Syntax; its type is Match-up; its language family is Atlantic–Congo, Gur; its Author is Aleka Blackwell. https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf Warlpiri puzzle:UKLO,...

work page 2022
[23]

https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_2-Warlpiri.pdf Wik-Mungkan puzzle:UKLO, 2022 Figure 4: The Wik-Mungkan puzzle was used in Round 2 of UKLO in 2022

This puzzle has two difficulty scores: its score for the Breakthrough participants is 41% and its score for the Foundation participants 45%; its linguistic topic is a combination of morphology and phonology; its type is Pattern; its language family is Pama-Nyungan; its Author is Mary Laughren. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_2-War...

work page 2024
[24]

https://www.uklo.org/wp-content/uploads/ 2022/05/2015_2.-Georgian.pdf Maonan puzzle:UKLO, 2024 Figure 7: The Maonan puzzle was used in Round 2 of UKLO in 2024

This puzzle has two difficulty scores: its score for the Breakthrough participants is 71%, its score for the Foundation participants is 79%; its linguistic topic is writing system; its type is Match-up; its language family is Kartvelian; its Author is Daniel Rucki. https://www.uklo.org/wp-content/uploads/ 2022/05/2015_2.-Georgian.pdf Maonan puzzle:UKLO, 2...

work page 2022
[25]

This puzzle has two difficulty scores: its score for the Foundation participants is 58%, its score for the Intermediate participants is 79%; ; its linguistic topic is a combination Phonology, Syntax, and Morphology; its type is a combination Match-up and Rosetta; its language family is Afro-Asiatic, Semitic; its Author is Michael Salter. https://www.uklo....

work page 2022
[26]

Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten

Its difficulty level is Breakthrough. Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten. https://www.uklo.org/wp-content/ uploads/2022/05/1_UKLO-2022-Swedish_ The-Pink-Pig-is-Pink_-Complete-Script.pdf Kabyle puzzle:UKLO, 2022 Figure 14: T...

work page 2022
[27]

https://www.uklo.org/wp-content/uploads/ 2022/05/2021_2-Kabyle.pdf Greek puzzle, parallel to the Georgian puzzle: UKLO, 2015 example Greece is a country in Southern Europe

This puzzle has two difficulty scores: its score for the Breakthrough participants is 44%, its score for the Foundation participants is 51%; ; its linguistic topic is a combination Syntax and Morphology; its type is Rosetta; its language family is Afro-Asiatic, Semitic; its Authors are Kazune Sato, Simi Hellsten. https://www.uklo.org/wp-content/uploads/ 2...

work page 2022

[1] [1]

Purvis, B., Mao, Y., and Robinson, D

Proof or bluff? Evaluating LLMs on 2025 USA Math Olympiad.Preprint, arXiv:2503.21934. Putsadee Pornphol and Suphamit Chittayasothorn. 2024. Using LLM Artificial Intelligence Systems as Com- plex SQL Programming Assistants. In12th Inter- national Conference on Information and Education Technology (ICIET), pages 477–481. Dragomir R. Radev, Lori S. Levin, an...

work page arXiv 2025

[2] [2]

A comprehensive survey on pre- trained foundation models: A history from bert to chatgpt

PathQG: Neural question generation from facts. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9066–9075, Online. Association for Computational Linguistics. Bingsheng Yao, Dakuo Wang, Tongshuang Wu, Zheng Zhang, Toby Jia-Jun Li, Mo Yu, and Ying Xu. 2022. It is AI‘s turn to ask humans a question: Ques...

work page arXiv 2020

[3] [3]

Did you describe the limitations of your work? [Yes]

work page

[4] [4]

Did you use or create scientific artifacts?

Did you discuss any potential risks of your work? [N/A] • B. Did you use or create scientific artifacts?

work page

[5] [5]

Did you cite the creators of artifacts you used? [Yes] We cite the creators of the LLMs used in Sections 1, 2, 3, 4, 5

work page

[6] [6]

Did you discuss the license or terms for use and / or distribution of any artifacts? [Yes]: Sections 1, 2

work page

[7] [7]

Did you discuss if your use of existing artifact(s) was consistent with their in- tended use, provided that it was spec- ified? For the artifacts you create, do you specify intended use and whether that is compatible with the original ac- cess conditions (in particular, derivatives of data accessed for research purposes should not be used outside of resea...

work page

[8] [8]

Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? [N/A]

work page

[9] [9]

Did you provide documentation of the artifacts, e.g., coverage of domains, lan- guages, and linguistic phenomena, demo- graphic groups represented, etc.? [Yes]: Sections 3, 4, 5

work page

[10] [10]

for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5

Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? [Yes] We report the relevant statistics in Section 3, 4, 5. • C. Did you run computational experiments?

work page

[11] [11]

Did you report the number of parame- ters in the models used, the total compu- tational budget (e.g., GPU hours), and computing infrastructure used? [N/A]

work page

[12] [12]

Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? [Yes]: Sections 4, 5

work page

[13] [13]

or just a single run? [Yes]: Sections 3, 4, 5

Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? [Yes]: Sections 3, 4, 5

work page

[14] [14]

Did you use human annotators (e.g., crowd- workers) or research with human participants?

If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation, such as NLTK, Spacy, ROUGE, etc.), did you report the imple- mentation, model, and parameter settings used? [No] • D. Did you use human annotators (e.g., crowd- workers) or research with human participants?

work page

[15] [15]

Did you report the full text of instruc- tions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? [Yes]: Section 5

work page

[16] [16]

Did you report information about how you recruited (e.g., crowdsourcing plat- form, students) and paid participants, and discuss if such payment is adequate given the participants’ demographic (e.g., country of residence)? [Yes]: Section 5

work page

[17] [17]

Did you discuss whether and how con- sent was obtained from people whose data you’re using/curating? [Yes]: Sec- tion 5

work page

[18] [18]

Was the data collection protocol ap- proved (or determined exempt) by an ethics review board? [N/A] Our experi- ment falls under one of the exempt cat- egories as per human subject research handbook

work page

[19] [19]

Did you report the basic demographic and geographic characteristics of the an- notator population that is the source of the data? [Yes] We mention this in Sec- tion 5. • E. Did you use AI assistants (e.g., ChatGPT, Copilot) in your research, coding, or writing?

work page

[20] [20]

B Appendix A: Examples of the UKLO Linguistic Puzzles Xhosa puzzle:UKLO, 2024 Figure 1: The Xhosa puzzle was used in UKLO in

Did you include information about your use of AI assistants? [Yes] LLMs are used in the experiments described in the paper. B Appendix A: Examples of the UKLO Linguistic Puzzles Xhosa puzzle:UKLO, 2024 Figure 1: The Xhosa puzzle was used in UKLO in

work page 2024

[21] [21]

https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_4-Xhosa.pdf Waama puzzle:UKLO, 2021 Figure 2: The Waama puzzle was used in UKLO in

This puzzle has two difficulty scores: its score for the Foundation participants is 58% and its score for the Intermediate participants 81%; its linguistic topic is morphology; its type is Rosetta; its language family is Atlantic–Congo, Bantu; its Author is Babette Verhoeven. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_4-Xhosa.pdf Waama puzzl...

work page 2024

[22] [22]

https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf Warlpiri puzzle:UKLO, 2024 Figure 3: The Warlpiri puzzle was used in UKLO in

This puzzle has two difficulty scores: its score for the Breakthrough participants is 42% and its score for the Foundation participants 54%; its linguistic topic is Syntax; its type is Match-up; its language family is Atlantic–Congo, Gur; its Author is Aleka Blackwell. https://www.uklo.org/wp-content/uploads/ 2022/05/2021_3-Waama.pdf Warlpiri puzzle:UKLO,...

work page 2022

[23] [23]

https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_2-Warlpiri.pdf Wik-Mungkan puzzle:UKLO, 2022 Figure 4: The Wik-Mungkan puzzle was used in Round 2 of UKLO in 2022

This puzzle has two difficulty scores: its score for the Breakthrough participants is 41% and its score for the Foundation participants 45%; its linguistic topic is a combination of morphology and phonology; its type is Pattern; its language family is Pama-Nyungan; its Author is Mary Laughren. https://www.uklo.org/wp-content/uploads/ 2024/04/2024_R1_2-War...

work page 2024

[24] [24]

https://www.uklo.org/wp-content/uploads/ 2022/05/2015_2.-Georgian.pdf Maonan puzzle:UKLO, 2024 Figure 7: The Maonan puzzle was used in Round 2 of UKLO in 2024

This puzzle has two difficulty scores: its score for the Breakthrough participants is 71%, its score for the Foundation participants is 79%; its linguistic topic is writing system; its type is Match-up; its language family is Kartvelian; its Author is Daniel Rucki. https://www.uklo.org/wp-content/uploads/ 2022/05/2015_2.-Georgian.pdf Maonan puzzle:UKLO, 2...

work page 2022

[25] [25]

This puzzle has two difficulty scores: its score for the Foundation participants is 58%, its score for the Intermediate participants is 79%; ; its linguistic topic is a combination Phonology, Syntax, and Morphology; its type is a combination Match-up and Rosetta; its language family is Afro-Asiatic, Semitic; its Author is Michael Salter. https://www.uklo....

work page 2022

[26] [26]

Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten

Its difficulty level is Breakthrough. Its score for participants is 38%; its linguistic topic is Morphology; its type is Rosetta; its language family is Indo-European, Germanic; its Author is David Hellsten. https://www.uklo.org/wp-content/ uploads/2022/05/1_UKLO-2022-Swedish_ The-Pink-Pig-is-Pink_-Complete-Script.pdf Kabyle puzzle:UKLO, 2022 Figure 14: T...

work page 2022

[27] [27]

https://www.uklo.org/wp-content/uploads/ 2022/05/2021_2-Kabyle.pdf Greek puzzle, parallel to the Georgian puzzle: UKLO, 2015 example Greece is a country in Southern Europe

This puzzle has two difficulty scores: its score for the Breakthrough participants is 44%, its score for the Foundation participants is 51%; ; its linguistic topic is a combination Syntax and Morphology; its type is Rosetta; its language family is Afro-Asiatic, Semitic; its Authors are Kazune Sato, Simi Hellsten. https://www.uklo.org/wp-content/uploads/ 2...

work page 2022