pith. sign in

arxiv: 2606.27103 · v1 · pith:SX2HRV7Inew · submitted 2026-06-25 · 💻 cs.CL

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

Pith reviewed 2026-06-26 04:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords riddle riddle paradigmflexible reasoninglarge language modelspattern matchingmemory retrievalreasoning strategiesliteral versus inventive interpretationhuman-AI comparison
0
0 comments X

The pith

LLMs solve genuine riddles at 85 percent accuracy but drop to 51 percent on versions rewritten to require only literal answers, while humans show the opposite pattern.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs achieve high accuracy on riddles through flexible strategy selection or through retrieving memorized patterns from training data. It introduces riddle riddles, which copy the surface form of popular riddles but change the content so that a correct answer needs a straightforward literal reading rather than an inventive twist. Across nine LLMs and 100 humans, models were far more accurate on the original riddles than on the altered versions, whereas people performed better on the altered versions. Error patterns show that most LLM mistakes on the literal versions came from applying inventive reasoning anyway, while human mistakes on real riddles more often came from sticking too literally to the words. The results indicate that strong LLM performance on riddles can be explained by surface-form matching rather than adaptive reasoning.

Core claim

The riddle riddle paradigm reveals that LLMs apply inventive reasoning even when a literal strategy is sufficient, producing lower accuracy on riddle riddles (50.7 percent) than on genuine riddles (84.9 percent), while humans show the reverse accuracy pattern (80.5 percent versus 50.5 percent). Most LLM errors on riddle riddles (90.8 percent) consist of inappropriately using inventive reasoning, compared with 57.6 percent of human errors on genuine riddles being overextension of literal reasoning. This contrast supports the claim that LLM success on genuine riddles reflects memory retrieval of familiar riddle structures rather than content-driven selection of reasoning strategies.

What carries the argument

The riddle riddle paradigm, which generates word problems that preserve riddle-like phrasing and structure but alter the required solution to a literal interpretation instead of an inventive one.

If this is right

  • LLM outputs that appear to demonstrate reasoning on tasks with familiar surface forms may instead result from pattern retrieval.
  • Cognitive benchmarks for LLMs require controls that force a contrast between surface-form strategies and content-based strategy switching.
  • Human performance advantages on literal versions of riddle-like problems point to a difference in how the two systems select reasoning modes.
  • Error analysis separating inappropriate inventive responses from other mistakes provides a diagnostic for distinguishing retrieval from flexible reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar surface-form controls could be applied to other reasoning benchmarks that use story problems or lateral-thinking items.
  • If training corpora contain many riddle examples, targeted removal or rewriting of such examples might reduce the performance gap between genuine and altered versions.
  • The paradigm could be extended to test whether the same retrieval-versus-flexibility split appears in non-riddle tasks that have both figurative and literal solution paths.

Load-bearing premise

Riddle riddles match genuine riddles in every respect except the shift from inventive to literal reasoning, with no unintended differences in difficulty, wording familiarity, or other surface features.

What would settle it

LLMs would achieve statistically equivalent accuracy on genuine riddles and riddle riddles after the two sets are matched for length, vocabulary frequency, and participant ratings of familiarity and difficulty.

Figures

Figures reproduced from arXiv: 2606.27103 by Bella Fascendini, Kathryn McGregor, Max D. Gupta, Thomas L. Griffiths.

Figure 1
Figure 1. Figure 1: Example human and LLM responses to three matched riddle pairs from our stimulus set. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM performance by condition (permissive coding). Each line represents one model. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Humans vs. mean LLM performance by condition. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic error rates for humans and LLMs. For each solver, the bar shows the proportion [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-model reasoning correctness by condition (permissive coding). Lines show the [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM accuracy by condition (strict coding). Each line shows one model’s estimated accuracy [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-model reasoning correctness by condition (strict coding). Lines show the proportion of [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Humans flexibly adapt their reasoning strategies to the requirements of a given problem. Large language models (LLMs) have performed well on many cognitive tasks, however, it is unclear whether this accuracy is a result of pattern matching from training data or flexible reasoning. Here, we introduce a novel paradigm to test this question: the riddle riddle paradigm. Riddle riddles are word problems written to mimic popular riddles, but altered so their answers only require literal interpretations. Identifying correct answers requires looking past the structure of each question and flexibly apply different reasoning strategies based on the content. If LLMs respond to surface features, such as form, a riddle-like structure should cause models to use an inventive reasoning strategy even when a literal interpretation suffices. Alternatively, if LLMs reason based on content, they should flexibly switch strategies when appropriate. Across two experiments with nine state-of-the-art LLMs and 100 human participants, we show humans and LLMs fail on this paradigm in opposite directions. LLMs were far more accurate on genuine riddles than on riddle riddles (84.9% vs. 50.7%); whereas humans showed the reverse effect (50.5% vs. 80.5%). Error analysis shows that 90.8% of LLM errors on riddle riddles (the condition where they show diminished performance) were due to inappropriate use of inventive reasoning while only 57.6% of human errors on genuine riddles were due to overextending literal reasoning. Thus, while both groups make mistakes, reasoning mistakes are made more often by LLMs than by humans. Overall, LLMs' strong performance on genuine riddles may reflect memory retrieval rather than flexible strategy selection, and without stimuli designed to elicit this contrast, it becomes easy to conflate LLM-generated outputs that look like reasoning with genuine reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the 'riddle riddle' paradigm: word problems that mimic the surface form of popular riddles but require only literal interpretations for correct answers. Across nine LLMs and 100 humans, LLMs achieve 84.9% on genuine riddles but only 50.7% on riddle riddles, while humans show the reverse (50.5% vs. 80.5%). Error analysis attributes 90.8% of LLM riddle-riddle errors to inappropriate inventive reasoning (vs. 57.6% of human genuine-riddle errors from over-literal responses), supporting the claim that LLM riddle performance reflects memory retrieval rather than flexible strategy selection.

Significance. If the riddle-riddle stimuli are shown to be matched to genuine riddles on difficulty, familiarity, and surface features, the opposite performance patterns would provide direct evidence that current LLMs lack the flexible strategy switching humans exhibit, with implications for interpreting 'reasoning' benchmarks. The work is a clean empirical contrast with no fitted parameters or circular derivations.

major comments (3)
  1. [Methods / Stimulus construction] Methods (stimulus construction): no quantitative matching is reported between genuine riddles and riddle riddles on dimensions such as word frequency, perplexity under a language model, sentence length, or pre-tested human difficulty ratings. The central claim that the 34-point LLM accuracy drop reflects inappropriate inventive reasoning rather than general task difficulty therefore rests on an untested equivalence assumption.
  2. [Results / Error analysis] Results (error analysis): the classification that 90.8% of LLM riddle-riddle errors are 'inappropriate inventive reasoning' inherits the same confound; without independent validation that riddle riddles do not differ in surface features that might elicit inventive responses, the error percentages cannot be unambiguously attributed to strategy selection.
  3. [Abstract / Results] Abstract and Results: prompt wording and exact model instructions are not provided, leaving open the possibility that differences in how the two conditions were framed (rather than the literal vs. inventive requirement) contribute to the observed reversal.
minor comments (2)
  1. [Results] The paper should report statistical tests (e.g., mixed-effects models) for the condition-by-group interaction rather than relying solely on raw accuracy percentages.
  2. [Methods] Clarify the exact number and selection criteria for the nine LLMs and the 100 human participants (e.g., recruitment platform, exclusion criteria).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Methods / Stimulus construction] Methods (stimulus construction): no quantitative matching is reported between genuine riddles and riddle riddles on dimensions such as word frequency, perplexity under a language model, sentence length, or pre-tested human difficulty ratings. The central claim that the 34-point LLM accuracy drop reflects inappropriate inventive reasoning rather than general task difficulty therefore rests on an untested equivalence assumption.

    Authors: We agree that reporting quantitative matching on these dimensions would provide stronger evidence for the equivalence of the stimuli. In the revised manuscript, we will include analyses comparing word frequency, sentence length, and perplexity between the genuine riddles and riddle riddles. Additionally, we will report any available pre-tested difficulty ratings or conduct a small validation if feasible. This will allow us to statistically test for differences and address the equivalence assumption directly. revision: yes

  2. Referee: [Results / Error analysis] Results (error analysis): the classification that 90.8% of LLM riddle-riddle errors are 'inappropriate inventive reasoning' inherits the same confound; without independent validation that riddle riddles do not differ in surface features that might elicit inventive responses, the error percentages cannot be unambiguously attributed to strategy selection.

    Authors: The error analysis is based on post-hoc classification of responses, and we acknowledge the potential for surface features to influence the type of errors. To address this, we will add a section in the revision where we provide independent ratings of the 'riddle-likeness' or surface similarity of the stimuli, and correlate these with error types if possible. However, the core design ensures that the correct response for riddle riddles is literal by construction, as the questions were altered specifically for that purpose. revision: partial

  3. Referee: [Abstract / Results] Abstract and Results: prompt wording and exact model instructions are not provided, leaving open the possibility that differences in how the two conditions were framed (rather than the literal vs. inventive requirement) contribute to the observed reversal.

    Authors: We recognize the importance of providing the exact prompts for reproducibility. In the revised version, we will include the full prompt templates used for both genuine riddles and riddle riddles across all models in the Methods section or as supplementary material. This will clarify that the instructions were consistent in framing the task as answering the question, without biasing towards inventive or literal responses. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison with measured outcomes

full rationale

The paper reports an experimental comparison of LLM and human accuracy on genuine riddles versus constructed riddle riddles, along with error coding. No equations, derivations, fitted parameters, or predictions appear anywhere in the abstract or described results. Performance metrics (84.9% vs 50.7% for LLMs; 50.5% vs 80.5% for humans) and error percentages are direct observational outcomes rather than quantities defined in terms of themselves or reduced to prior self-citations. The central claim rests on the observed performance gap and error patterns, which are independent of any self-referential construction. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters or invented entities. It rests on one domain assumption about stimulus construction.

axioms (1)
  • domain assumption Riddle riddles can be written so that they match genuine riddles in surface structure while requiring only literal interpretation, without introducing other confounds
    This assumption is required to attribute accuracy differences to reasoning strategy selection rather than stimulus artifacts.

pith-pipeline@v0.9.1-grok · 5872 in / 1354 out tokens · 38803 ms · 2026-06-26T04:42:49.754506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Thompson

    Rakefet Ackerman and Valerie A. Thompson. Meta-reasoning: Monitoring and control of thinking and reasoning.Trends in Cognitive Sciences, 21(8):607–617, 2017

  2. [2]

    Stumpers: An annotated compendium.Thinking & Reasoning, 27(4):536–566, 2021

    Maya Bar-Hillel. Stumpers: An annotated compendium.Thinking & Reasoning, 27(4):536–566, 2021

  3. [3]

    A is B” fail to learn “B is A

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. InInternational Conference on Learning Representations, 2024

  4. [4]

    Using cognitive psychology to understand GPT-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

    Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

  5. [5]

    Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  6. [6]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4.arXiv preprint arXiv:2303.12712, 2023

  7. [7]

    Liu, Elizabeth Bonawitz, and Tomer D

    Junyi Chu, Misha O’Keeffe, Silvia K. Liu, Elizabeth Bonawitz, and Tomer D. Ullman. Stumped! Learning to think outside the box in 3-7 year old children. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025. URLhttps://escholarship.org/uc/item/1jd4n5hf

  8. [8]

    John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979

  9. [9]

    Explaining and Harnessing Adversarial Examples

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInternational Conference on Learning Representations (ICLR), 2015. URL https://arxiv. org/abs/1412.6572

  10. [10]

    Farrar, Straus and Giroux, 2011

    Daniel Kahneman.Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011

  11. [11]

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

    Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InProceedings of the 35th International Conference on Machine Learning, pages 2873–2882. PMLR, 2018

  12. [12]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174, 1977

  13. [13]

    Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  14. [14]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, 2004

  15. [15]

    Thomas McCoy, Ellie Pavlick, and Tal Linzen

    R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, 2019. 10

  16. [16]

    Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D

    R. Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D. Hardy, and Thomas L. Griffiths. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41):e2322420121, 2024

  17. [17]

    Simon.Human Problem Solving

    Allen Newell and Herbert A. Simon.Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ, 1972

  18. [18]

    Payne, James R

    John W. Payne, James R. Bettman, and Eric J. Johnson. Adaptive strategy selection in decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3):534–552, 1988

  19. [19]

    Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. In International Conference on Learning Representations, 2024

  20. [20]

    Siegler.Emerging Minds: The Process of Change in Children’s Thinking

    Robert S. Siegler.Emerging Minds: The Process of Change in Children’s Thinking. Oxford University Press, 1996

  21. [21]

    The illusion-illusion: Vision language models see illusions where there are none.arXiv preprint arXiv:2412.18613, 2024

    Tomer Ullman. The illusion-illusion: Vision language models see illusions where there are none.arXiv preprint arXiv:2412.18613, 2024

  22. [22]

    Holyoak, and Hongjing Lu

    Taylor Webb, Keith J. Holyoak, and Hongjing Lu. Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023

  23. [23]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  24. [24]

    Reasoning or memorization? unreliable results of reinforcement learning due to data contamination

    Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, and Qi Zhang. Reasoning or memorization? unreliable results of reinforcement learning due to data contamination. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  25. [25]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in Neural Information Processing Systems, 36:11809–11822, 2023

  26. [26]

    I can be underwater for 10 minutes using no type of equipment or air pockets!

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. A Full stimulus set The 30 matched riddle pairs used in Experiments 1 and 2 are shown in Table 1. Condition A items a...

  27. [27]

    (<strategy>) <canonical answer>

  28. [28]

    Model response: <model’s response> 20

    (<strategy>) <alternative 1> ... Model response: <model’s response> 20