From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

Anne Huang; Elena Filatova; Jinfan Frank Hu; Neh Majmudar

arxiv: 2605.13408 · v1 · pith:EOXZTLKFnew · submitted 2026-05-13 · 💻 cs.CL

From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

Neh Majmudar , Anne Huang , Jinfan Frank Hu , Elena Filatova This is my paper

Pith reviewed 2026-05-14 19:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords linguistic puzzlesRosetta StoneMatch-Uppaired corpusLLM benchmarkshuman solverspuzzle conversionlinguistic reasoning

0 comments

The pith

A conversion procedure turns Rosetta Stone puzzles into Match-Up versions that both expert humans and LLMs either solve completely or fail entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a systematic way to convert existing Rosetta Stone linguistic puzzles into corresponding Match-Up puzzles to speed up creation of new examples. This paired dataset lets researchers compare how the two formats test the same underlying linguistic reasoning. Tests with expert human solvers and large language models on the converted puzzles show an all-or-nothing pattern where participants succeed fully or not at all. The work supplies a new resource for studying puzzle difficulty and offers benchmarks for both human and machine performance on these tasks.

Core claim

The authors develop a systematic conversion procedure that produces Match-Up counterparts from Rosetta Stone puzzles. When the resulting pairs are solved by expert humans and LLMs, both groups display an all-or-nothing pattern on the Match-Up versions, completing them entirely or failing completely.

What carries the argument

The systematic conversion procedure that transforms Rosetta Stone puzzles into Match-Up counterparts while aiming to preserve linguistic reasoning demands.

Load-bearing premise

The conversion procedure produces Match-Up puzzles that preserve the original linguistic reasoning demands and difficulty level of the Rosetta Stone versions.

What would settle it

A converted Match-Up puzzle on which expert solvers produce partial but incomplete correct answers would challenge the all-or-nothing pattern.

read the original abstract

In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The conversion method for pairing Rosetta Stone and Match-Up puzzles is the actual new piece, but the all-or-nothing result lacks the checks needed to confirm it reflects format differences rather than conversion artifacts.

read the letter

The paper's useful contribution is a systematic conversion procedure that turns existing Rosetta Stone puzzles into Match-Up versions, producing a paired corpus for high-school level linguistic competitions. This addresses the real bottleneck of puzzle creation time, and the resulting dataset could be reused for education or narrow benchmarking tasks. They then run human experts and LLMs on the pairs and report that Match-Up puzzles produce an all-or-nothing outcome: solvers either complete them fully or fail outright. That pattern is at least worth noting for anyone building similar test sets. The conversion approach itself is described at a level that seems reproducible enough to try on new puzzles. The main limitation is that the all-or-nothing claim rests on an untested assumption. The abstract gives no sample sizes, no item-level comparison of clue structures or ambiguity before and after conversion, and no error analysis showing that difficulty stayed constant. Without those checks, the binary pattern could come from added constraints or simplifications in the conversion step rather than anything inherent to the Match-Up format. The paper does not appear to include statistical tests or controls for selection bias in the puzzle set. This work is aimed at researchers who build or evaluate puzzle-based benchmarks in linguistics and AI, or at educators who need more examples in these formats. A reader already working on constrained language tasks or dataset generation methods will find the paired data and procedure worth examining. It is not broad enough for general language understanding claims. I would send it to peer review. The dataset is new and the conversion idea has practical value, so referees can help tighten the validation and reporting without starting from scratch.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a systematic procedure to convert existing Rosetta Stone linguistic puzzles into paired Match-Up counterparts, generating a new dataset of such pairs. It then evaluates the pairs using expert human solvers and LLMs, reporting that both exhibit an all-or-nothing performance pattern on Match-Up puzzles (complete success or total failure). The work positions the conversion method as an efficient way to scale puzzle creation and uses the paired data to compare difficulty and reasoning demands across formats.

Significance. If the conversion is shown to preserve original reasoning demands, the paired corpus would be a useful resource for controlled comparisons of linguistic reasoning in humans versus LLMs. The reported all-or-nothing pattern, if robustly demonstrated, could highlight format-specific challenges and inform the construction of more diagnostic benchmarks for machine linguistic capabilities.

major comments (2)

[§3] §3 (Conversion Procedure): The central claim that the systematic conversion produces Match-Up puzzles with equivalent linguistic reasoning demands to the Rosetta Stone originals is not supported by any validation step, such as item-level matching of clue structures, ambiguity types, or pre/post-conversion solver error distributions. Without this, the observed all-or-nothing pattern on Match-Up puzzles could be an artifact of the conversion rather than an inherent format difference.
[Results] Results section (and abstract): The all-or-nothing pattern is asserted for both humans and LLMs without reporting sample sizes (number of participants, puzzles per condition, or LLM instances), statistical tests for binarity, or error analysis. This absence makes it impossible to evaluate robustness against selection bias or small-sample effects, directly affecting the load-bearing empirical claim.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of the number of puzzle pairs generated and the exact human/LLM sample sizes to allow readers to gauge scale immediately.
[Figures/Tables] Figure captions and table headers should explicitly label whether performance metrics are per-puzzle or aggregated, and whether 'all-or-nothing' is defined by exact match or partial credit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback highlights important areas where additional detail will improve the clarity and robustness of our claims regarding the conversion procedure and the empirical results. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (Conversion Procedure): The central claim that the systematic conversion produces Match-Up puzzles with equivalent linguistic reasoning demands to the Rosetta Stone originals is not supported by any validation step, such as item-level matching of clue structures, ambiguity types, or pre/post-conversion solver error distributions. Without this, the observed all-or-nothing pattern on Match-Up puzzles could be an artifact of the conversion rather than an inherent format difference.

Authors: We agree that the manuscript would benefit from explicit validation steps to support the equivalence of reasoning demands. The conversion procedure was constructed as a systematic, rule-based mapping that directly translates each linguistic clue, ambiguity, and dependency from the Rosetta Stone format into the corresponding Match-Up structure. In the revised manuscript we will add a dedicated subsection to §3 that includes item-level examples demonstrating preserved clue structures and ambiguity types across multiple puzzle pairs. We will also report a comparison of human solver error distributions on the original and converted versions for the evaluated pairs. These additions will provide direct evidence that the all-or-nothing pattern arises from format differences rather than conversion artifacts. revision: yes
Referee: [Results] Results section (and abstract): The all-or-nothing pattern is asserted for both humans and LLMs without reporting sample sizes (number of participants, puzzles per condition, or LLM instances), statistical tests for binarity, or error analysis. This absence makes it impossible to evaluate robustness against selection bias or small-sample effects, directly affecting the load-bearing empirical claim.

Authors: We acknowledge that the current reporting lacks the necessary quantitative details for assessing robustness. The revised Results section and abstract will explicitly report the sample sizes used, including the number of expert human participants, the number of puzzles per condition, and the number of LLM instances (with model versions and run counts). We will add formal statistical tests for the binarity of performance distributions (e.g., a multimodality test) along with a categorized error analysis of failure modes for both humans and LLMs. These changes will allow readers to evaluate potential selection bias and small-sample effects directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and observed patterns

full rationale

The paper proposes a conversion procedure from Rosetta Stone to Match-Up puzzles and reports empirical results from human solvers and LLMs on the resulting pairs. No equations, fitted parameters, or derivations are present that reduce the reported all-or-nothing pattern to inputs by construction. The central observations arise directly from the new paired dataset and external benchmarks rather than self-definition, self-citation chains, or renaming of prior results. The conversion step is an input-generation method whose validity is a separate empirical question, not a definitional reduction of the findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Rosetta Stone and Match-Up puzzles can be converted while preserving equivalent linguistic reasoning challenges; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Linguistic puzzles can be systematically converted between Rosetta Stone and Match-Up formats while preserving core reasoning demands
This assumption underpins the entire paired corpus and the validity of cross-format comparisons.

pith-pipeline@v0.9.0 · 5437 in / 1213 out tokens · 84694 ms · 2026-05-14T19:47:54.062872+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Introduction Linguistic puzzles designed for high school–level competitions, such as the International Linguistics Olympiad (IOL)1 and various national contests, are nowusednotonlytoassesstheskillsofhighschool students and other linguistics enthusiasts but also as benchmarks for evaluating the performance of LargeLanguageModels(LLMs)(Beanetal.,2024). Thus...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

They provide insight into what constitutes a well-designed linguistic puzzle

work page
[3]

They can inform and guide the development of future puzzle-generation procedures

work page
[4]

They can support the analysis of LLMs’ chain- of-thought reasoning, helping to better under- stand differences in decision-making between humans and LLMs. The rest of the paper is organized as follows: • Section 2 reviews prior work on linguistic puz- zle corpus creation and summarizes research on solving linguistic puzzles by both humans and large langua...

work page
[5]

experienced solvers are better prepared to handle these [Rosetta Stone puzzles] than problems of other types

Related Work 2.1. Existing Corpora The International Linguistics Olympiad (IOL), along with many national linguistic competitions, re- leases past puzzles together with their official solu- tions. Examples include NACLO (North America),6 OzCLO(Australia),7 UKLO (United Kingdom),8 etc. Theproblemspublishedbynationalcompetitions are incorporated into severa...

work page 2024
[6]

We focus on these two formats because they are the most frequently used in linguistic competitions

Corpus In this project, we investigate whether Rosetta Stone and Match-Up puzzles represent genuinely distinct formats or simply reflect different perspec- tives on the same underlying puzzle structure. We focus on these two formats because they are the most frequently used in linguistic competitions. Ac- cording to data released by UKLO, 45% of all com- ...

work page 2023
[7]

Solving Rosetta Stone and Match-Up Puzzle Pairs Using the generated corpus (Section 3), we evalu- ate how converting Rosetta Stone puzzles into the Match-Up format affects human and LLM perfor- mance on puzzle-solving tasks. 4.1. Human Evaluation Experiment For the human evaluation, we engaged two ac- complished Linguistic Olympiad participants with NACLO...

work page 2024
[8]

all-or-nothing

Approaches Towards Solving Linguistic Puzzles After completing the experiment described in Sec- tion 4.1, the human evaluators were interviewed about their experience with puzzles of different for- mats and the strategies they used to solve them. At this stage, the evaluators were not aware that the Match-Up puzzles had been derived from the corresponding...

work page 2013
[9]

Conclusion We test the hypothesis that two linguistic puzzle formats, Rosetta Stone and Match-Up, represent complementaryviewsofthesameunderlyingstruc- ture. To investigate this, we develop a systematic conversion procedure and apply it to create a cor- pus of 96 paired puzzles, each consisting of an original Rosetta Stone puzzle and its Match-Up counterp...

work page
[10]

Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored

Ethics Statement This study involved two experienced high-school lingusitic puzzle solvers. Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored. Tasks used publicly available competition puzzles. No sensitive attributes were elicited. The released dataset contains only problem state...

work page
[11]

Limitations Humanstudy(N=2)limitsinference;language/topic coverage is imbalanced; some Rosetta puzzles (e.g., multi-template verb systems) cannot be con- verted without additional heuristics; LLM scores may vary across versions and decoding settings; and our strict evaluation for Rosetta (exact-match only) may undercount partial progress

work page
[12]

Data and Code Availability The data is realized in: https://github.com/ef2020/lrec2026-data

work page
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Bibliographical References Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, and Hannah Rose Kirk. 2024.Lin- gOly: A Benchmark of Olympiad-Level Linguis- tic Reasoning Puzzles in Low-Resource and Ex- tinct Languages. InProceedings of the Thirty- Eighth Annual Conference on Neural Information Processing Syst...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Introduction Linguistic puzzles designed for high school–level competitions, such as the International Linguistics Olympiad (IOL)1 and various national contests, are nowusednotonlytoassesstheskillsofhighschool students and other linguistics enthusiasts but also as benchmarks for evaluating the performance of LargeLanguageModels(LLMs)(Beanetal.,2024). Thus...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

They provide insight into what constitutes a well-designed linguistic puzzle

work page

[3] [3]

They can inform and guide the development of future puzzle-generation procedures

work page

[4] [4]

They can support the analysis of LLMs’ chain- of-thought reasoning, helping to better under- stand differences in decision-making between humans and LLMs. The rest of the paper is organized as follows: • Section 2 reviews prior work on linguistic puz- zle corpus creation and summarizes research on solving linguistic puzzles by both humans and large langua...

work page

[5] [5]

experienced solvers are better prepared to handle these [Rosetta Stone puzzles] than problems of other types

Related Work 2.1. Existing Corpora The International Linguistics Olympiad (IOL), along with many national linguistic competitions, re- leases past puzzles together with their official solu- tions. Examples include NACLO (North America),6 OzCLO(Australia),7 UKLO (United Kingdom),8 etc. Theproblemspublishedbynationalcompetitions are incorporated into severa...

work page 2024

[6] [6]

We focus on these two formats because they are the most frequently used in linguistic competitions

Corpus In this project, we investigate whether Rosetta Stone and Match-Up puzzles represent genuinely distinct formats or simply reflect different perspec- tives on the same underlying puzzle structure. We focus on these two formats because they are the most frequently used in linguistic competitions. Ac- cording to data released by UKLO, 45% of all com- ...

work page 2023

[7] [7]

Solving Rosetta Stone and Match-Up Puzzle Pairs Using the generated corpus (Section 3), we evalu- ate how converting Rosetta Stone puzzles into the Match-Up format affects human and LLM perfor- mance on puzzle-solving tasks. 4.1. Human Evaluation Experiment For the human evaluation, we engaged two ac- complished Linguistic Olympiad participants with NACLO...

work page 2024

[8] [8]

all-or-nothing

Approaches Towards Solving Linguistic Puzzles After completing the experiment described in Sec- tion 4.1, the human evaluators were interviewed about their experience with puzzles of different for- mats and the strategies they used to solve them. At this stage, the evaluators were not aware that the Match-Up puzzles had been derived from the corresponding...

work page 2013

[9] [9]

Conclusion We test the hypothesis that two linguistic puzzle formats, Rosetta Stone and Match-Up, represent complementaryviewsofthesameunderlyingstruc- ture. To investigate this, we develop a systematic conversion procedure and apply it to create a cor- pus of 96 paired puzzles, each consisting of an original Rosetta Stone puzzle and its Match-Up counterp...

work page

[10] [10]

Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored

Ethics Statement This study involved two experienced high-school lingusitic puzzle solvers. Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored. Tasks used publicly available competition puzzles. No sensitive attributes were elicited. The released dataset contains only problem state...

work page

[11] [11]

Limitations Humanstudy(N=2)limitsinference;language/topic coverage is imbalanced; some Rosetta puzzles (e.g., multi-template verb systems) cannot be con- verted without additional heuristics; LLM scores may vary across versions and decoding settings; and our strict evaluation for Rosetta (exact-match only) may undercount partial progress

work page

[12] [12]

Data and Code Availability The data is realized in: https://github.com/ef2020/lrec2026-data

work page

[13] [13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Bibliographical References Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, and Hannah Rose Kirk. 2024.Lin- gOly: A Benchmark of Olympiad-Level Linguis- tic Reasoning Puzzles in Low-Resource and Ex- tinct Languages. InProceedings of the Thirty- Eighth Annual Conference on Neural Information Processing Syst...

work page internal anchor Pith review Pith/arXiv arXiv 2024