pith. sign in

arxiv: 2506.05205 · v2 · submitted 2025-06-05 · 💻 cs.CL

RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context

Pith reviewed 2026-05-19 10:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelscomplex reasoningcontext-free languagesin-context learningreasoning tokensbenchmarksevaluation frameworks
0
0 comments X

The pith

Advanced reasoning models fail to scale their effort and instead reduce reasoning tokens as context-free language recognition tasks grow harder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create RELIC to test models on deciding whether a string is part of a language defined by a grammar given in the prompt. They adjust difficulty by increasing the number of grammar rules or the length of the string to see how models respond. The key finding is that models do not spend more time thinking on harder cases but use fewer reasoning steps and shift toward guessing. If this holds, it means current approaches to reasoning in LLMs may break down when many pieces of information must be combined over long chains.

Core claim

Even the most advanced reasoning models perform poorly on RELIC, failing to scale inference compute with task difficulty and instead reducing the number of reasoning tokens as complexity increases, which accompanies a change from algorithmic solutions to guessing strategies.

What carries the argument

The RELIC evaluation framework that uses recognition of context-free languages presented in-context, with complexity modulated by grammar size and string length.

If this is right

  • Models may not be reliable for tasks that require increasing amounts of computation as problems become harder.
  • Observed decreases in reasoning tokens indicate a strategy change toward guessing on difficult instances.
  • The quiet quitting behavior suggests models may produce low-effort answers without explicit signals when tasks exceed a certain complexity.
  • RELIC provides a scalable way to measure how reasoning strategies degrade with complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reductions in effort might occur in other formal reasoning domains beyond context-free languages.
  • Developers could design training or prompting methods that explicitly encourage maintaining or increasing compute on harder sub-tasks.
  • The framework might be adapted to test reasoning in other areas like planning or verification where complexity can be controlled.

Load-bearing premise

That a reduction in the number of reasoning tokens reliably signals a move to guessing rather than an efficient or adaptive approach to the harder problem.

What would settle it

An experiment where models are shown to increase or maintain the count of reasoning tokens while improving accuracy on instances with larger grammars or longer strings would challenge the finding.

Figures

Figures reproduced from arXiv: 2506.05205 by Jackson Petty, Michael Y. Hu, Shauli Ravfogel, Tal Linzen, Wentao Wang, William Merrill.

Figure 1
Figure 1. Figure 1: RELIC is an evaluation framework for compositional instruction following, where we (a) stochastically generate context-free grammars (e.g., sets of instructions) of given complexities, (b) sample positive and negative strings (e.g., tasks) for each grammar, and (c) prompt models to classify whether the strings are generated by those grammars. 1 Introduction Large language models (LLMs) are increasingly exp… view at source ↗
Figure 2
Figure 2. Figure 2: Models’ accuracy on RELIC-500 reduces to near-chance (dashed lines) for all models as instruction set complexity (number of nonlexical productions in the grammar, top row) and task complexity (example length, bottom row) increase. 0% 100% Accuracy positive negative gpt-4.1-nano gpt-4.1-mini gpt-4.1 o4-mini o3 gemma-3-1b gemma-3-4b DSR1-7B 1 50 Task Complexity (Example Length) 0% 100% Predicted Type positiv… view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy on positive and negative examples differs between models and across sequence [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Spearman’s rank correlation coefficients for the mean accuracy per grammar ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Successfully accepting a valid string by [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: As task complexity (example length) increases, the test-time compute (TTC) expended by [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlations between generating hyperparameters for the released static set. Note that [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Proportions of example types represented in the static dataset, in aggregate ( [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of grammars by coverage (i.e., the size of the language they generate, measured [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Spearman’s rank correlation coefficients for the per-example accuracies between different [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Spearman’s rank correlation coefficients for the per-example accuracies between different [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Absolute test-time-compute expended by models as a function of example length. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: (top row) Among open-weights models, the Gemma 3 variants mostly generated comple￾tions well below the new token limit, while DSR1 routinely failed to finish completions within bounds. (bottom row) Within each grammar, models have a lower accuracy on finished completions than they do on incomplete generations. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately gauge how model performance and strategy change as task complexity grows. To evaluate models' complex reasoning capability in a scalable and verifiable way, we introduce RELIC (Recognition of Languages In-Context), a framework that evaluates an LLM's ability to decide whether a given string belongs to the context-free language (CFL) generated by a grammar presented in-context. CFL recognition allows us to modulate the intrinsic complexity of the problem by varying grammar size and string length and translate this asymptotic complexity into predictions for ideal LLM performance. We find that even the most advanced reasoning models perform poorly on RELIC, not only failing to appropriately scale their inference compute to keep pace with task difficulty, but even reducing the number of reasoning tokens they use as task complexity increases. We find that these decreases in compute accompany changes in reasoning strategy, as models move from identifying and implementing algorithmic solutions to guessing. For models whose full completions go uninspected, this manifests as ``quiet quitting'' on hard tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RELIC, a framework for evaluating LLMs' complex reasoning capabilities through in-context recognition of context-free languages (CFLs). Grammars are provided in-context, and task complexity is modulated by varying grammar size and string length, enabling asymptotic predictions for ideal model performance. The central empirical claim is that even advanced reasoning models perform poorly on RELIC: they fail to scale inference compute with increasing difficulty and instead reduce the number of reasoning tokens used, accompanied by a shift from algorithmic solutions to guessing (sometimes described as 'quiet quitting' on hard tasks).

Significance. If substantiated with appropriate controls, RELIC offers a scalable and verifiable benchmark grounded in formal language theory for probing how LLMs adapt reasoning effort to task complexity. The ability to generate clear performance predictions from CFL properties is a methodological strength, and the reported pattern of reduced token usage on harder instances challenges common assumptions about LLM scaling of compute. This could inform future work on reasoning strategies if the token-reduction finding is isolated from output artifacts.

major comments (2)
  1. [Abstract] Abstract: The interpretation that a decrease in reasoning tokens with increasing grammar size or string length indicates a shift from algorithmic CFL recognition to guessing is not isolated from confounds. Without controls such as inspecting whether partial derivations remain in the trace, normalizing token counts by total generation length, or checking for early termination on long inputs, the strategy-shift claim remains vulnerable to alternative explanations like truncation or format changes.
  2. [Abstract] The manuscript reports clear empirical findings on performance and token usage but the abstract provides no quantitative results, error bars, dataset sizes, or verification that token reduction is not an artifact of prompting/decoding settings. Full results with these details are needed to assess the magnitude and reliability of the central claims about compute scaling and strategy changes.
minor comments (2)
  1. Clarify the exact definition and measurement of 'reasoning tokens' versus total output tokens, and specify whether token counts are taken from the full completion or only the final answer portion.
  2. Add explicit discussion of how the CFL recognition task translates asymptotic complexity into concrete predictions for LLM behavior, including any assumptions about perfect in-context learning of the grammar.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each major comment below, clarifying our analyses and indicating where revisions have been made to strengthen the presentation of results and controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The interpretation that a decrease in reasoning tokens with increasing grammar size or string length indicates a shift from algorithmic CFL recognition to guessing is not isolated from confounds. Without controls such as inspecting whether partial derivations remain in the trace, normalizing token counts by total generation length, or checking for early termination on long inputs, the strategy-shift claim remains vulnerable to alternative explanations like truncation or format changes.

    Authors: We thank the referee for identifying potential confounds in the token-reduction interpretation. Our original trace analysis already showed that harder instances frequently lack complete algorithmic steps such as full derivations or recursive checks, consistent with a shift toward direct guessing. In the revision we have added explicit controls: (1) manual inspection of a stratified sample of traces confirming incomplete partial derivations on high-complexity items, (2) normalization of reasoning-token counts by total generated length (the reduction remains statistically significant), and (3) verification that generation settings use fixed maximum lengths with no differential early stopping across difficulty levels. These additions reduce the plausibility of truncation or format-change explanations while preserving the core observation. revision: yes

  2. Referee: [Abstract] The manuscript reports clear empirical findings on performance and token usage but the abstract provides no quantitative results, error bars, dataset sizes, or verification that token reduction is not an artifact of prompting/decoding settings. Full results with these details are needed to assess the magnitude and reliability of the central claims about compute scaling and strategy changes.

    Authors: We agree that the abstract would benefit from quantitative anchors. The revised abstract now reports key metrics: mean accuracy declines from 82% (small grammars) to 31% (large grammars), with standard errors computed over five independent runs; each complexity bin contains 400–600 test strings; and the token-reduction pattern is reproduced under both greedy decoding and temperature-0.7 sampling with varied prompt phrasings. These details are also expanded in the main results section with accompanying tables. revision: yes

Circularity Check

0 steps flagged

No circularity: RELIC derives performance expectations from standard CFL theory without self-referential reduction

full rationale

The paper's core mapping from grammar size and string length to ideal LLM performance expectations follows directly from standard asymptotic complexity results in formal language theory for context-free languages, which are externally defined and not fitted or redefined within the paper. Empirical measurements of token counts, accuracy, and strategy shifts on external model APIs are observational and falsifiable independently of any author-defined parameters. No equations, predictions, or central claims reduce by construction to inputs via self-definition, fitted subsets, or load-bearing self-citations; the framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the standard definition of context-free languages and the assumption that token count is a reasonable proxy for inference compute; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Context-free language recognition is a suitable proxy for complex multi-step reasoning in LLMs
    The paper selects CFL recognition because its asymptotic complexity can be modulated and ideal performance can be predicted.

pith-pipeline@v0.9.0 · 5751 in / 1213 out tokens · 44968 ms · 2026-05-19T10:51:11.345643+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diagnosing CFG Interpretation in LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    doi: 10.1109/t-c.1973.223746

    ISSN 0018-9340. doi: 10.1109/t-c.1973.223746. URL https://ieeexplore.ieee.org/ document/1672339. A. Butoi, G. Khalighinejad, A. Svete, J. Valvoda, R. Cotterell, and B. DuSell. Training neural networks as recognizers of formal languages, 11 Nov. 2024. URLhttp://arxiv.org/abs/2411.07107. N. Chomsky and M. P. Schützenberger. The Algebraic Theory of Context-F...

  2. [2]

    URL https://proceedings.mlr.press/v57/clark16.pdf

    PMLR. URL https://proceedings.mlr.press/v57/clark16.pdf. 10 A. Clark. syntheticpcfg: Code for generating synthetic PCFGs for testing grammatical inference algorithms, 2018. URL https://github.com/alexc17/syntheticpcfg. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou...

  3. [3]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.731. URL https://aclanthology.org/2020.emnlp-main.731/. L. Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication.Journal of the ACM, 49(1):1–15, 1 Jan. 2002. ISSN 0004-5411,1557-735X. doi: 10.1145/505241.505242. URL https://dl.acm.org/doi/10.1145/50524...

  4. [4]

    URL https://openreview.net/forum?id=bNt7oajl2a. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research: JMLR , 21(2020):1–67, 23 Oct. 2019. ISSN 1532-4435,1533-7928. URL http://arxiv.org/abs/1910....

  5. [5]

    URL https://openreview.net/forum?id=gEZrGCozdqR. T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural lan- guage processing. In Pr...

  6. [6]

    Transformers: State-of-the-Art Natural Language Processing

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL http://dx.doi.org/10.18653/v1/2020.emnlp-demos.6. X. Zhang, Y . Chen, S. Hu, Z. Xu, J. Chen, M. K. Hao, X. Han, Z. L. Thai, S. Wang, Z. Liu, and M. Sun. ∞Bench: Extending long context evaluation beyond 100K tokens, 21 Feb. 2024. URL http://arxiv.org/abs/2402.13718. L. Zhen...

  7. [7]

    t44 → Not t27. 2. t249 → Not t27. 3. t80 → Not t27. 4. t333 → Not t27. 5. t171 → Not t27. 6. t194 → Not t27. 7. t329 → Not t27. 8. t224 → Not t27. 9. t196 → Not t27. 10. t64 → Not t27. 11. t8 → Not t27. 12. t43 → Not t27. 13. t16 → Not t27. 14. t15 → Not t27. 15. t280 → Not t27. 16. t220 → Not t27. 17. t107 → Not t27. 18. t230 → Not t27. 19. t334 → Not t2...

  8. [8]

    t19 → Not t27. 22. t296 → Not t27. After reviewing all the tokens, I don’t see ’t27’ anywhere in the string. Each token is unique and none of them is ’t27’. So, based on my evaluation, the output of the evaluation doesn’t contain ’t27’. Therefore, the condition isn’t met, and the answer should be ’No’. 22 <think> No F.4 gpt-4.1-mini, Grammar 2025041516250...

  9. [9]

    **NT2** - NT2 -> NT3 NT6 - NT2 -> ’t9’ - NT2 -> ’t23’ - NT2 -> ’t27’ - NT2 -> ’t4’ - NT2 -> ’t30’

  10. [10]

    **NT3** (various terminals) - ’t27’, ’t29’, ’t12’, ’t1’, ’t28’, ’t20’, ’t23’, ’t24’ (In particular, ’t24’ is here.)

  11. [11]

    **NT4** - NT4 -> NT2 NT5 - NT4 -> many terminals: ’t18’, ’t34’, ’t23’, ’t5’, ’t29’, ’t11’, ’t25’, ’t4’, ’t7’

  12. [12]

    **NT5** - NT5 -> NT0 NT5 - NT5 -> many terminals: ’t18’, ’t23’, ’t25’, ’t13’ Also: - NT0 has many terminals like ’t24’, ’t27’, ’t13’, etc., including ’t24’

  13. [13]

    if the string is long, it is likely generated by the grammar

    **NT6** - NT6 -> NT4 NT6 - NT6 -> many terminals: ’t8’, ’t25’, ’t29’, ’t22’, ’t12’, ’t9’, ’t18’, ’t10’ –- 23 ### String to parse: ‘t30 t24 t24 t23 t4‘ –- ### Step 1: Since S -> NT5 NT2 The string breaks into two substrings: - substring generated by NT5 - substring generated by NT2 Let’s try possible splits: - split after 1 token: (t30) (t24 t24 t23 t4) - ...