pith. machine review for the scientific record. sign in

arxiv: 2604.13371 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords reasoning collapselarge language modelscomplexity thresholdsreasoning tasksphase transitionbenchmarkingvalidity constraintsdiscrete problems
0
0 comments X

The pith

Large language models undergo reasoning collapse beyond task-specific complexity thresholds in discrete reasoning problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests large reasoning models on nine classical tasks including Sudoku and the Tower of Hanoi, with complexity gradually increased through parameterization. Using strict validators that accept only complete valid solutions, the evaluation shows strong performance on easy instances but sharp declines past certain points. The degradation appears consistently across models and tasks, involving failures like violating constraints or losing track of states. Such patterns indicate that aggregate accuracy scores hide fundamental robustness issues in handling growing problem difficulty. The authors argue this calls for new evaluation approaches focused on complexity scaling rather than fixed benchmarks.

Core claim

The authors establish that LLMs display consistent phase transition-like behavior on these tasks: high accuracy at low complexity regimes gives way to sharp degradation beyond specific thresholds, a phenomenon they term reasoning collapse, marked by accuracy drops often over 50 percent along with invalid and inconsistent outputs.

What carries the argument

A suite of nine parameterized classical reasoning tasks equipped with deterministic validators that enforce explicit validity constraints on solutions.

If this is right

  • Accuracy declines often exceed 50% past the thresholds.
  • Longer reasoning traces do not reliably increase correctness.
  • Gains on one problem family fail to generalize to others.
  • Current models lose state tracking and produce confidently incorrect outputs at high complexity.
  • Static benchmarks are insufficient for measuring reasoning robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These limits may stem from the autoregressive nature of LLMs struggling with long-horizon state maintenance.
  • Architectures with explicit memory or search mechanisms could potentially raise the collapse thresholds.
  • The identified thresholds provide a quantitative way to compare reasoning capabilities across models.
  • Similar collapse phenomena might appear in other domains like code generation or theorem proving when complexity is scaled.

Load-bearing premise

The specific ways complexity is increased in each of the nine tasks accurately reflect greater reasoning demands without adding unrelated biases to the problem space.

What would settle it

Observing no significant accuracy drop across low, medium, and high complexity versions of these tasks in a new model or evaluation would falsify the existence of reasoning collapse.

Figures

Figures reproduced from arXiv: 2604.13371 by Akif Islam, Dipankar Das, Md. Fahad Ullah Utsho, Md. Golam Rashed, Mohd. Ruhul Ameen.

Figure 1
Figure 1. Figure 1: Tower of Hanoi setup with three pegs (A, B, C) and n disks; only the top disk may be moved, and no larger disk may be placed on a smaller disk. pattern-based strategies. 3.1 Defining the Problem Families We evaluate LRM reasoning using nine con￾trollable puzzle environments spanning recur￾sion, sequential arithmetic, constraint satisfac￾tion, combinatorial search, and spatial planning. Each puzzle admits a… view at source ↗
Figure 2
Figure 2. Figure 2: Checker Jumping initial configuration for n red and n blue checkers on a (2n + 1)-cell board with a single empty space. Implications for Reasoning Evaluation. Because each additional disk doubles the op￾timal trajectory length, local reasoning errors compound rapidly. If each step is correct with probability p, the success probability scales as: Psuccess ≈ p Lmin(n) , leading to sharp collapse once the req… view at source ↗
Figure 4
Figure 4. Figure 4: River Crossing: example goal config￾uration where all actors and agents are safely transported to the right bank [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Valid boat moves in River Crossing: the boat carries a non-empty subset of entities (up to capacity k) between banks while main￾taining safety constraints. 3.1.3 River Crossing River Crossing is a safety-constrained sequen￾tial planning problem where each intermedi￾ate state must satisfy strict validity conditions. Unlike purely structural puzzles (e.g., Tower of Hanoi), River Crossing requires maintaining… view at source ↗
Figure 6
Figure 6. Figure 6: Boolean Satisfiability (SAT): CNF for￾mula structure with variables, literals, and clause interactions. Cognitive Load and Failure Modes. River Crossing difficulty arises from: 1. Global safety dependence: each action requires evaluating the entire next configu￾ration. 2. Long-horizon planning: invalid states are often irreversible. 3. Branching under constraints: many syntactically possible boat selection… view at source ↗
Figure 7
Figure 7. Figure 7: Cryptarithmetic (alphametic) puzzle: each letter must map to a unique digit, and the decoded arithmetic expression must hold exactly. 3.1.5 Cryptarithmetic Cryptarithmetic is a symbolic constraint sat￾isfaction problem in which each letter maps to a unique digit and an arithmetic equa￾tion must hold under this mapping. Because valid solutions require enforcing injectivity, pre￾venting leading zeros, and sa… view at source ↗
Figure 8
Figure 8. Figure 8: Graph Coloring example: each vertex receives a color from a fixed palette such that adjacent vertices do not share the same color. density—and correctness depends on enforcing global adjacency constraints—Graph Coloring serves as a principled benchmark for evaluat￾ing reasoning under increasing constraint inter￾action. Formal Structure. Let G = (V, E) be an undirected graph with |V | = n vertices and |E| =… view at source ↗
Figure 9
Figure 9. Figure 9: Water Jug setup with two jugs of ca￾pacities c1 and c2, illustrating permissible oper￾ations (fill, empty, pour). Complexity and Cognitive Load. For ran￾dom colorings, the probability of avoiding a con￾flict on any edge is approximately: P(V (f) = 1) ≈  1 − 1 k m , which decays exponentially as m increases. Graphs near their chromatic threshold (i.e., k ≈ χ(G)) exhibit the strongest interaction con￾strai… view at source ↗
Figure 10
Figure 10. Figure 10: Sudoku grid structure: digits must satisfy row, column, and 3×3 subgrid uniqueness constraints. 3.1.8 Sudoku Sudoku is a dense global constraint satisfaction problem defined over a 9×9 grid. The task is to complete a partially filled grid so that each row, column, and 3 × 3 subgrid contains the digits {1, . . . , 9} exactly once. Difficulty can be con￾trolled by reducing the number of givens and construct… view at source ↗
Figure 11
Figure 11. Figure 11: Rubik’s Cube representation and face-turn notation used to describe legal moves. 3.1.9 Rubik’s Cube Rubik’s Cube is a spatial planning and state￾space search problem defined over a vast but fi￾nite discrete space. Starting from a scrambled configuration of the standard 3 × 3 × 3 cube, the goal is to reach the solved state using a sequence of legal face turns. The task probes complexity￾induced limits due … view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template for the Tower of Hanoi puzzle, including role specification, rule definitions, and required move-list output for￾mat [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template for Cryptarithmetic puzzles, outlining digit-uniqueness, leading-zero constraints, and mapping output format. Fail if a valid output is produced but violates constraints, and Collapse if the model fails to produce a parseable or complete solution within the allocated budget. This validation protocol decouples correctness from linguistic plausibility and prevents models from receiving impli… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt structure for the River Cross￾ing puzzle, emphasizing capacity limits and the jealous-husbands safety constraint [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt template for the Water Jug puzzle, including the fixed action space and structured move-list format [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt template for Sudoku, using the 81-character string representation for puzzle input and output. cess rates, we analyze qualitative changes in rea￾soning traces across complexity levels, includ￾ing trace length, internal consistency, repetition, and premature termination. These metrics en￾able identification of reasoning collapse thresh￾olds and characterization of failure modes be￾yond raw accuracy.… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt template for the 3×3 Ru￾bik’s Cube, using WCA notation for all legal face-turns [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: LLM task performance across Tower of Hanoi complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: LLM outcomes on the Water Jug task across increasing difficulty. [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Model robustness on Boolean SAT across clause-complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Performance trajectories on the Checker Jumping puzzle. [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Graph Coloring pass/fail/collapse profiles across complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: LLM performance on River Crossing under escalating multi-agent constraints. [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Rubik’s Cube performance across permutation-complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Sudoku performance, illustrating extreme sensitivity to symbolic-consistency violations. [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗
Figure 31
Figure 31. Figure 31: Global error distribution per model, showing proportions of correct outputs, incorrect outputs, and collapses across all tasks and com￾plexity levels. 5.5 Global Collapse Thresholds Across Tasks and Models [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗
Figure 30
Figure 30. Figure 30: Global mean pass rate as a function of complexity level. models tend to fail “gracefully,” producing struc￾tured but incorrect outputs, whereas weaker models exhibit catastrophic collapses, highlight￾ing fundamental architectural differences in rea￾soning robustness [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗
Figure 32
Figure 32. Figure 32: Global collapse threshold summary showing, for each model, the highest complexity level Lk at which valid solutions are reliably pro￾duced across all nine reasoning tasks. deepest scaling, whereas constraint-dense tasks (e.g., Boolean SAT, Graph Coloring) and high￾branching combinatorial tasks (e.g., Checker Jumping, Rubik’s Cube) induce earlier collapse. This global trend underscores that collapse thresh… view at source ↗
Figure 33
Figure 33. Figure 33: Global Failure Mode Map across tasks and difficulty levels. Outcomes are encoded as: [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Global mean-performance heatmap aggregating pass rates across all models and tasks. [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Success Score Matrix showing normalized mean pass rates per task–model pair. [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Aggregate model strength, measured [PITH_FULL_IMAGE:figures/full_fig_p038_36.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik's Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a controlled benchmarking framework using nine classical reasoning tasks (Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, Rubik's Cube) parameterized by increasing complexity levels while preserving semantics. It evaluates multiple open and proprietary Large Reasoning Models with deterministic validators, reporting high accuracy at low complexity but sharp degradation (often >50%) beyond task-specific thresholds, accompanied by inconsistent traces, constraint violations, and loss of state tracking. The authors formalize this as 'reasoning collapse' and argue that increased reasoning length does not reliably improve performance and that gains do not generalize across problem families, calling for evaluation methods beyond static benchmarks.

Significance. If the complexity controls are shown to isolate reasoning demand without confounds, the work would provide valuable empirical evidence of phase-transition-like limits in LLM reasoning robustness, shifting focus from aggregate benchmark scores to controlled scaling studies. The use of deterministic validators and a diverse task suite strengthens the empirical grounding, though the absence of detailed controls and statistics limits immediate generalizability to broader reasoning capabilities.

major comments (2)
  1. [Abstract] Abstract and described methodology: The claim that tasks are 'parameterized to precisely control complexity while preserving underlying semantics' does not specify how parameters (e.g., clause count in SAT, disk count in Hanoi, clue count in Sudoku) hold solution length or state-space cardinality fixed while varying only logical depth. This is load-bearing for the central 'reasoning collapse' claim, as the observed accuracy drops could arise from token-budget exhaustion or increased constraint-violation probability rather than any collapse in reasoning.
  2. [Results] Results and evaluation sections: No details are provided on sample sizes per complexity regime, number of trials, statistical tests for the phase-transition behavior, or full per-task/per-model accuracy tables. This makes the assertions of 'consistent' behavior and 'often exceeding 50%' declines unverifiable and weakens the cross-task generalization claim.
minor comments (2)
  1. [Abstract] The phrase 'phase transition like behavior' should be hyphenated as 'phase-transition-like' for clarity.
  2. [Introduction] The manuscript would benefit from explicit comparison to prior work on LLM reasoning limits (e.g., studies on chain-of-thought scaling or constraint satisfaction benchmarks) to better situate the novelty of the 'reasoning collapse' formalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which help clarify the presentation of our controlled benchmarking framework and strengthen the empirical claims. We address each major comment below and commit to revisions that enhance methodological transparency and statistical rigor without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract and described methodology: The claim that tasks are 'parameterized to precisely control complexity while preserving underlying semantics' does not specify how parameters (e.g., clause count in SAT, disk count in Hanoi, clue count in Sudoku) hold solution length or state-space cardinality fixed while varying only logical depth. This is load-bearing for the central 'reasoning collapse' claim, as the observed accuracy drops could arise from token-budget exhaustion or increased constraint-violation probability rather than any collapse in reasoning.

    Authors: We agree that explicit parameterization details are essential to substantiate the isolation of reasoning demand. The revised manuscript will expand the methodology section with a per-task table and description showing how parameters were selected: for SAT, clause-to-variable ratio is increased while holding variable count fixed (thus controlling state-space size); for Tower of Hanoi, disk count is varied but solution length grows predictably with the standard recursive strategy; for Sudoku, clue count is reduced while preserving unique solvability and minimal solution steps. We will explicitly note that complete decoupling of logical depth from state-space cardinality is not always feasible across all tasks and will add a limitations subsection discussing potential confounds such as token limits and validator strictness. These additions will allow readers to evaluate whether the observed drops reflect reasoning collapse rather than resource exhaustion. revision: yes

  2. Referee: [Results] Results and evaluation sections: No details are provided on sample sizes per complexity regime, number of trials, statistical tests for the phase-transition behavior, or full per-task/per-model accuracy tables. This makes the assertions of 'consistent' behavior and 'often exceeding 50%' declines unverifiable and weakens the cross-task generalization claim.

    Authors: We concur that the current presentation lacks sufficient statistical detail for full verifiability. In the revision we will add a new 'Evaluation Protocol' subsection specifying: 100 problem instances per complexity level per task, 3 independent trials per instance with temperature 0 for determinism where possible, and use of binomial confidence intervals plus paired t-tests to assess significance of accuracy drops across regimes. Complete per-task, per-model accuracy tables (including exact decline percentages) will be moved to the appendix, with summary statistics in the main results section. These changes will directly support the claims of consistent phase-transition behavior and cross-task patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with explicit task parameterization

full rationale

The paper conducts a controlled empirical study across nine fixed tasks with manually defined complexity parameters (e.g., clause count, disk count). No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear. 'Reasoning collapse' is introduced as a descriptive label for observed accuracy drops, not derived from prior equations or self-citations. Complexity controls are stated directly in the task construction and evaluated against external deterministic validators, keeping the chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about task representativeness and validator accuracy rather than free parameters or new entities; complexity controls are described as precise but not fitted to data.

axioms (2)
  • domain assumption The nine selected classical reasoning tasks can be parameterized to control complexity while preserving underlying semantics and validity constraints.
    Invoked when constructing the benchmark suite and defining low/intermediate/high regimes.
  • domain assumption Deterministic validators provide accurate, unbiased assessment of solution validity across all complexity levels.
    Used to filter outputs and ensure only fully valid solutions are counted as correct.
invented entities (1)
  • reasoning collapse no independent evidence
    purpose: To label the observed sharp performance degradation and associated failure modes as complexity increases.
    Introduced as a formalization of the empirical pattern; no independent evidence outside the reported observations is provided.

pith-pipeline@v0.9.0 · 5579 in / 1371 out tokens · 41994 ms · 2026-05-10T14:08:54.888642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in ":" * " " * FUNCTION f...

  2. [2]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in Large Language Models , 2022

  3. [3]

    Large Language Models are zero-shot reasoners, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are zero-shot reasoners, 2022

  4. [4]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2022 a

  5. [5]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Sharma, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

  6. [6]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, and Reiichiro Nakano. Training verifiers to solve math word problems, 2021

  7. [7]

    Measuring mathematical problem solving with the MATH dataset, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset, 2021

  8. [8]

    Proving test set contamination in black-box language models, 2023

    Yonatan Oren, Nicole Meister, Nitish Gupta, Tuhin Chakrabarty, Jonathan Valter, Thomas Steinke, Matt Tancik, Danqi Chen, Percy Liang, Sergey Levine, et al. Proving test set contamination in black-box language models, 2023

  9. [9]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://machinelearning.apple.com/research/illusion-of-thinking. Apple Machine Learning Research, published June 2025

  10. [11]

    A survey on Large Language Model benchmarks, 2025

    Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, and Min Yang. A survey on Large Language Model benchmarks, 2025

  11. [13]

    Bean, Ryan Othniel Kearns, et al

    Andrew M. Bean, Ryan Othniel Kearns, et al. Measuring what matters: Construct validity in large language model benchmarks. In Advances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track

  12. [14]

    ReasonBENCH : Benchmarking the (in)stability of LLM reasoning, 2025

    Nearchos Potamitis, Lars Klein, and Akhil Arora. ReasonBENCH : Benchmarking the (in)stability of LLM reasoning, 2025

  13. [15]

    Dyval: Dynamic evaluation of large language models for reasoning tasks

    Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjfOL9z5Xr

  14. [16]

    Benchmarking rule-based reasoning abilities of large language models

    Jiayi Gui, Yiming Liu, Jiale Cheng, et al. Benchmarking rule-based reasoning abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2025, 2025 a . URL https://aclanthology.org/2025.findings-acl.77

  15. [18]

    Alibaba cloud model studio: Qwen Plus , 2025

    Alibaba Cloud . Alibaba cloud model studio: Qwen Plus , 2025. URL https://www.alibabacloud.com/help/en/model-studio/models

  16. [19]

    Kimi K2 : Open agentic intelligence, 2025

    Moonshot AI . Kimi K2 : Open agentic intelligence, 2025. URL https://github.com/MoonshotAI/Kimi-K2.5

  17. [20]

    Claude 3.7 Sonnet system card

    Anthropic . Claude 3.7 Sonnet system card. Technical report, Anthropic, 2025

  18. [21]

    Gemini 3 Pro preview documentation, 2026

    Google DeepMind . Gemini 3 Pro preview documentation, 2026. URL https://deepmind.google/technologies/gemini/

  19. [22]

    GPT-5 architecture and capabilities, 2026

    OpenAI . GPT-5 architecture and capabilities, 2026. URL https://openai.com/research/

  20. [23]

    DeepSeek-V3.2 : Pushing the frontier of open Large Language Models , 2025

    DeepSeek-AI . DeepSeek-V3.2 : Pushing the frontier of open Large Language Models , 2025

  21. [24]

    A survey on evaluation of Large Language Models , 2023

    Anonymous. A survey on evaluation of Large Language Models , 2023

  22. [25]

    Construct validity in Large Language Model benchmarks, 2025

    Anonymous. Construct validity in Large Language Model benchmarks, 2025. NeurIPS 2025 Datasets and Benchmarks Track

  23. [26]

    Zhu et al

    K. Zhu et al. DyVal : Dynamic evaluation of Large Language Models for controlled complexity, 2025. OpenReview: DyVal framework

  24. [27]

    Gui et al

    J. Gui et al. Benchmarking rule-based reasoning abilities of Large Language Models . In Findings of the Association for Computational Linguistics (ACL), 2025 b

  25. [28]

    Saha et al

    S. Saha et al. Learning to plan & reason for evaluation with thinking LLM-as-a-Judge , 2025 b

  26. [29]

    Beyond accuracy: Evaluating the reasoning behavior of Large Language Models , 2024

    Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of Large Language Models , 2024

  27. [30]

    The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance?, 2024

    Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance?, 2024

  28. [31]

    Magistral : Mistral's first reasoning model, 2025

    Mistral AI . Magistral : Mistral's first reasoning model, 2025

  29. [32]

    title Alibaba cloud model studio: Qwen Plus

    author Alibaba Cloud , year 2025 . title Alibaba cloud model studio: Qwen Plus . https://www.alibabacloud.com/help/en/model-studio/models

  30. [33]

    Yu, Qiang Yang, and Xing Xie

    author Anonymous , year 2023 . title A survey on evaluation of Large Language Models . arXiv:2307.03109 http://arxiv.org/abs/2307.03109

  31. [34]

    title Construct validity in Large Language Model benchmarks

    author Anonymous , year 2025 . title Construct validity in Large Language Model benchmarks . note NeurIPS 2025 Datasets and Benchmarks Track

  32. [35]

    title Claude 3.7 Sonnet System Card

    author Anthropic , year 2025 . title Claude 3.7 Sonnet System Card . type Technical Report . Anthropic

  33. [36]

    Aaron Chatterji, Thomas Cunningham, David J Dem- ing, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman

    author Banerjee, S. , author Agarwal, A. , author Singh, E. , year 2024 . title The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance? arXiv:2412.03597 http://arxiv.org/abs/2412.03597

  34. [37]

    Training Verifiers to Solve Math Word Problems

    author Cobbe, K. , author Kosaraju, V. , author Bavarian, M. , author Chen, M. , author Jun, H. , author Kaiser, L. , author Plappert, M. , author Tworek, J. , author Hilton, J. , author Nakano, R. , year 2021 . title Training verifiers to solve math word problems . arXiv:2110.14168 http://arxiv.org/abs/2110.14168

  35. [38]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    author DeepSeek-AI , year 2025 . title DeepSeek-V3.2 : Pushing the frontier of open Large Language Models . arXiv:2512.02556 http://arxiv.org/abs/2512.02556

  36. [39]

    title Gemini 3 Pro preview documentation

    author Google DeepMind , year 2026 . title Gemini 3 Pro preview documentation . https://deepmind.google/technologies/gemini/

  37. [40]

    , et al., year 2025

    author Gui, J. , et al., year 2025 . title Benchmarking rule-based reasoning abilities of Large Language Models , in: booktitle Findings of the Association for Computational Linguistics (ACL)

  38. [41]

    Measuring Mathematical Problem Solving With the MATH Dataset

    author Hendrycks, D. , author Burns, C. , author Basart, S. , author Zou, A. , author Mazeika, M. , author Song, D. , author Steinhardt, J. , year 2021 . title Measuring mathematical problem solving with the MATH dataset . arXiv:2103.03874 http://arxiv.org/abs/2103.03874

  39. [42]

    Large Language Models are Zero-Shot Reasoners

    author Kojima, T. , author Gu, S.S. , author Reid, M. , author Matsuo, Y. , author Iwasawa, Y. , year 2022 . title Large Language Models are zero-shot reasoners . arXiv:2205.11916 http://arxiv.org/abs/2205.11916

  40. [43]

    Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al

    author Mistral AI , year 2025 . title Magistral : Mistral's first reasoning model . arXiv:2506.10910 http://arxiv.org/abs/2506.10910

  41. [44]

    Beyond accuracy: evaluating the reasoning behavior of large language models–a survey.arXiv preprint arXiv:2404.01869, 2024

    author Mondorf, P. , author Plank, B. , year 2024 . title Beyond accuracy: Evaluating the reasoning behavior of Large Language Models . arXiv:2404.01869 http://arxiv.org/abs/2404.01869

  42. [45]

    title Kimi K2 : Open agentic intelligence

    author Moonshot AI , year 2025 . title Kimi K2 : Open agentic intelligence . https://github.com/MoonshotAI/Kimi-K2.5

  43. [46]

    A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,

    author Ni, S. , author Chen, G. , author Li, S. , author Chen, X. , author Li, S. , author Wang, B. , author Wang, Q. , author Wang, X. , author Zhang, Y. , author Fan, L. , author Li, C. , author Xu, R. , author Sun, L. , author Yang, M. , year 2025 . title A survey on Large Language Model benchmarks . arXiv:2508.15361 http://arxiv.org/abs/2508.15361

  44. [47]

    title GPT-5 architecture and capabilities

    author OpenAI , year 2026 . title GPT-5 architecture and capabilities . https://openai.com/research/

  45. [48]

    Proving test set contamination in black box language models

    author Oren, Y. , author Meister, N. , author Gupta, N. , author Chakrabarty, T. , author Valter, J. , author Steinke, T. , author Tancik, M. , author Chen, D. , author Liang, P. , author Levine, S. , et al., year 2023 . title Proving test set contamination in black-box language models . arXiv:2310.17623 http://arxiv.org/abs/2310.17623

  46. [49]

    , author Klein, L

    author Potamitis, N. , author Klein, L. , author Arora, A. , year 2025 . title ReasonBENCH : Benchmarking the (in)stability of LLM reasoning . arXiv:2512.07795 http://arxiv.org/abs/2512.07795

  47. [50]

    Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

    author Saha, S. , et al., year 2025 . title Learning to plan & reason for evaluation with thinking LLM-as-a-Judge . arXiv:2501.18099 http://arxiv.org/abs/2501.18099

  48. [51]

    , author Mirzadeh, I

    author Shojaee, P. , author Mirzadeh, I. , author Alizadeh, K. , author Horton, M. , author Bengio, S. , author Farajtabar, M. , year 2025 . title The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity . https://machinelearning.apple.com/research/illusion-of-thinking. note apple Machine...

  49. [52]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    author Srivastava, A. , author Rastogi, A. , author Sharma, A. , et al., year 2022 . title Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . arXiv:2206.04615 http://arxiv.org/abs/2206.04615

  50. [53]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    author Wang, X. , author Wei, J. , author Schuurmans, D. , author Le, Q.V. , author Chi, E.H. , author Zhou, D. , year 2022 . title Self-consistency improves chain of thought reasoning in language models . arXiv:2203.11171 http://arxiv.org/abs/2203.11171

  51. [54]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    author Wei, J. , author Wang, X. , author Schuurmans, D. , author Bosma, M. , author Xia, F. , author Chi, E.H. , author Le, Q.V. , author Zhou, D. , year 2022 . title Chain-of-thought prompting elicits reasoning in Large Language Models . arXiv:2201.11903 http://arxiv.org/abs/2201.11903

  52. [55]

    , et al., year 2025

    author Zhu, K. , et al., year 2025 . title DyVal : Dynamic evaluation of Large Language Models for controlled complexity . note OpenReview: DyVal framework