pith. sign in

arxiv: 2605.13408 · v1 · pith:EOXZTLKFnew · submitted 2026-05-13 · 💻 cs.CL

From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks

Pith reviewed 2026-05-14 19:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords linguistic puzzlesRosetta StoneMatch-Uppaired corpusLLM benchmarkshuman solverspuzzle conversionlinguistic reasoning
0
0 comments X

The pith

A conversion procedure turns Rosetta Stone puzzles into Match-Up versions that both expert humans and LLMs either solve completely or fail entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a systematic way to convert existing Rosetta Stone linguistic puzzles into corresponding Match-Up puzzles to speed up creation of new examples. This paired dataset lets researchers compare how the two formats test the same underlying linguistic reasoning. Tests with expert human solvers and large language models on the converted puzzles show an all-or-nothing pattern where participants succeed fully or not at all. The work supplies a new resource for studying puzzle difficulty and offers benchmarks for both human and machine performance on these tasks.

Core claim

The authors develop a systematic conversion procedure that produces Match-Up counterparts from Rosetta Stone puzzles. When the resulting pairs are solved by expert humans and LLMs, both groups display an all-or-nothing pattern on the Match-Up versions, completing them entirely or failing completely.

What carries the argument

The systematic conversion procedure that transforms Rosetta Stone puzzles into Match-Up counterparts while aiming to preserve linguistic reasoning demands.

Load-bearing premise

The conversion procedure produces Match-Up puzzles that preserve the original linguistic reasoning demands and difficulty level of the Rosetta Stone versions.

What would settle it

A converted Match-Up puzzle on which expert solvers produce partial but incomplete correct answers would challenge the all-or-nothing pattern.

read the original abstract

In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a systematic procedure to convert existing Rosetta Stone linguistic puzzles into paired Match-Up counterparts, generating a new dataset of such pairs. It then evaluates the pairs using expert human solvers and LLMs, reporting that both exhibit an all-or-nothing performance pattern on Match-Up puzzles (complete success or total failure). The work positions the conversion method as an efficient way to scale puzzle creation and uses the paired data to compare difficulty and reasoning demands across formats.

Significance. If the conversion is shown to preserve original reasoning demands, the paired corpus would be a useful resource for controlled comparisons of linguistic reasoning in humans versus LLMs. The reported all-or-nothing pattern, if robustly demonstrated, could highlight format-specific challenges and inform the construction of more diagnostic benchmarks for machine linguistic capabilities.

major comments (2)
  1. [§3] §3 (Conversion Procedure): The central claim that the systematic conversion produces Match-Up puzzles with equivalent linguistic reasoning demands to the Rosetta Stone originals is not supported by any validation step, such as item-level matching of clue structures, ambiguity types, or pre/post-conversion solver error distributions. Without this, the observed all-or-nothing pattern on Match-Up puzzles could be an artifact of the conversion rather than an inherent format difference.
  2. [Results] Results section (and abstract): The all-or-nothing pattern is asserted for both humans and LLMs without reporting sample sizes (number of participants, puzzles per condition, or LLM instances), statistical tests for binarity, or error analysis. This absence makes it impossible to evaluate robustness against selection bias or small-sample effects, directly affecting the load-bearing empirical claim.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief statement of the number of puzzle pairs generated and the exact human/LLM sample sizes to allow readers to gauge scale immediately.
  2. [Figures/Tables] Figure captions and table headers should explicitly label whether performance metrics are per-puzzle or aggregated, and whether 'all-or-nothing' is defined by exact match or partial credit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback highlights important areas where additional detail will improve the clarity and robustness of our claims regarding the conversion procedure and the empirical results. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (Conversion Procedure): The central claim that the systematic conversion produces Match-Up puzzles with equivalent linguistic reasoning demands to the Rosetta Stone originals is not supported by any validation step, such as item-level matching of clue structures, ambiguity types, or pre/post-conversion solver error distributions. Without this, the observed all-or-nothing pattern on Match-Up puzzles could be an artifact of the conversion rather than an inherent format difference.

    Authors: We agree that the manuscript would benefit from explicit validation steps to support the equivalence of reasoning demands. The conversion procedure was constructed as a systematic, rule-based mapping that directly translates each linguistic clue, ambiguity, and dependency from the Rosetta Stone format into the corresponding Match-Up structure. In the revised manuscript we will add a dedicated subsection to §3 that includes item-level examples demonstrating preserved clue structures and ambiguity types across multiple puzzle pairs. We will also report a comparison of human solver error distributions on the original and converted versions for the evaluated pairs. These additions will provide direct evidence that the all-or-nothing pattern arises from format differences rather than conversion artifacts. revision: yes

  2. Referee: [Results] Results section (and abstract): The all-or-nothing pattern is asserted for both humans and LLMs without reporting sample sizes (number of participants, puzzles per condition, or LLM instances), statistical tests for binarity, or error analysis. This absence makes it impossible to evaluate robustness against selection bias or small-sample effects, directly affecting the load-bearing empirical claim.

    Authors: We acknowledge that the current reporting lacks the necessary quantitative details for assessing robustness. The revised Results section and abstract will explicitly report the sample sizes used, including the number of expert human participants, the number of puzzles per condition, and the number of LLM instances (with model versions and run counts). We will add formal statistical tests for the binarity of performance distributions (e.g., a multimodality test) along with a categorized error analysis of failure modes for both humans and LLMs. These changes will allow readers to evaluate potential selection bias and small-sample effects directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and observed patterns

full rationale

The paper proposes a conversion procedure from Rosetta Stone to Match-Up puzzles and reports empirical results from human solvers and LLMs on the resulting pairs. No equations, fitted parameters, or derivations are present that reduce the reported all-or-nothing pattern to inputs by construction. The central observations arise directly from the new paired dataset and external benchmarks rather than self-definition, self-citation chains, or renaming of prior results. The conversion step is an input-generation method whose validity is a separate empirical question, not a definitional reduction of the findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Rosetta Stone and Match-Up puzzles can be converted while preserving equivalent linguistic reasoning challenges; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Linguistic puzzles can be systematically converted between Rosetta Stone and Match-Up formats while preserving core reasoning demands
    This assumption underpins the entire paired corpus and the validity of cross-format comparisons.

pith-pipeline@v0.9.0 · 5437 in / 1213 out tokens · 84694 ms · 2026-05-14T19:47:54.062872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    Introduction Linguistic puzzles designed for high school–level competitions, such as the International Linguistics Olympiad (IOL)1 and various national contests, are nowusednotonlytoassesstheskillsofhighschool students and other linguistics enthusiasts but also as benchmarks for evaluating the performance of LargeLanguageModels(LLMs)(Beanetal.,2024). Thus...

  2. [2]

    They provide insight into what constitutes a well-designed linguistic puzzle

  3. [3]

    They can inform and guide the development of future puzzle-generation procedures

  4. [4]

    They can support the analysis of LLMs’ chain- of-thought reasoning, helping to better under- stand differences in decision-making between humans and LLMs. The rest of the paper is organized as follows: • Section 2 reviews prior work on linguistic puz- zle corpus creation and summarizes research on solving linguistic puzzles by both humans and large langua...

  5. [5]

    experienced solvers are better prepared to handle these [Rosetta Stone puzzles] than problems of other types

    Related Work 2.1. Existing Corpora The International Linguistics Olympiad (IOL), along with many national linguistic competitions, re- leases past puzzles together with their official solu- tions. Examples include NACLO (North America),6 OzCLO(Australia),7 UKLO (United Kingdom),8 etc. Theproblemspublishedbynationalcompetitions are incorporated into severa...

  6. [6]

    We focus on these two formats because they are the most frequently used in linguistic competitions

    Corpus In this project, we investigate whether Rosetta Stone and Match-Up puzzles represent genuinely distinct formats or simply reflect different perspec- tives on the same underlying puzzle structure. We focus on these two formats because they are the most frequently used in linguistic competitions. Ac- cording to data released by UKLO, 45% of all com- ...

  7. [7]

    Solving Rosetta Stone and Match-Up Puzzle Pairs Using the generated corpus (Section 3), we evalu- ate how converting Rosetta Stone puzzles into the Match-Up format affects human and LLM perfor- mance on puzzle-solving tasks. 4.1. Human Evaluation Experiment For the human evaluation, we engaged two ac- complished Linguistic Olympiad participants with NACLO...

  8. [8]

    all-or-nothing

    Approaches Towards Solving Linguistic Puzzles After completing the experiment described in Sec- tion 4.1, the human evaluators were interviewed about their experience with puzzles of different for- mats and the strategies they used to solve them. At this stage, the evaluators were not aware that the Match-Up puzzles had been derived from the corresponding...

  9. [9]

    Conclusion We test the hypothesis that two linguistic puzzle formats, Rosetta Stone and Match-Up, represent complementaryviewsofthesameunderlyingstruc- ture. To investigate this, we develop a systematic conversion procedure and apply it to create a cor- pus of 96 paired puzzles, each consisting of an original Rosetta Stone puzzle and its Match-Up counterp...

  10. [10]

    Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored

    Ethics Statement This study involved two experienced high-school lingusitic puzzle solvers. Written informed consent (and parental/guardian consent for minors) was ob- tained; no personal data were collected or stored. Tasks used publicly available competition puzzles. No sensitive attributes were elicited. The released dataset contains only problem state...

  11. [11]

    Limitations Humanstudy(N=2)limitsinference;language/topic coverage is imbalanced; some Rosetta puzzles (e.g., multi-template verb systems) cannot be con- verted without additional heuristics; LLM scores may vary across versions and decoding settings; and our strict evaluation for Rosetta (exact-match only) may undercount partial progress

  12. [12]

    Data and Code Availability The data is realized in: https://github.com/ef2020/lrec2026-data

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Bibliographical References Andrew M. Bean, Simi Hellsten, Harry Mayne, Jabez Magomere, Ethan A. Chi, Ryan Chi, Scott A. Hale, and Hannah Rose Kirk. 2024.Lin- gOly: A Benchmark of Olympiad-Level Linguis- tic Reasoning Puzzles in Low-Resource and Ex- tinct Languages. InProceedings of the Thirty- Eighth Annual Conference on Neural Information Processing Syst...