arxiv: 2604.13371 · v1 · submitted 2026-04-15 · 💻 cs.CL

Recognition: unknown

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Md. Fahad Ullah Utsho , Mohd. Ruhul Ameen , Akif Islam , Md. Golam Rashed , Dipankar Das

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords reasoning collapselarge language modelscomplexity thresholdsreasoning tasksphase transitionbenchmarkingvalidity constraintsdiscrete problems

0 comments

The pith

Large language models undergo reasoning collapse beyond task-specific complexity thresholds in discrete reasoning problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests large reasoning models on nine classical tasks including Sudoku and the Tower of Hanoi, with complexity gradually increased through parameterization. Using strict validators that accept only complete valid solutions, the evaluation shows strong performance on easy instances but sharp declines past certain points. The degradation appears consistently across models and tasks, involving failures like violating constraints or losing track of states. Such patterns indicate that aggregate accuracy scores hide fundamental robustness issues in handling growing problem difficulty. The authors argue this calls for new evaluation approaches focused on complexity scaling rather than fixed benchmarks.

Core claim

The authors establish that LLMs display consistent phase transition-like behavior on these tasks: high accuracy at low complexity regimes gives way to sharp degradation beyond specific thresholds, a phenomenon they term reasoning collapse, marked by accuracy drops often over 50 percent along with invalid and inconsistent outputs.

What carries the argument

A suite of nine parameterized classical reasoning tasks equipped with deterministic validators that enforce explicit validity constraints on solutions.

If this is right

Accuracy declines often exceed 50% past the thresholds.
Longer reasoning traces do not reliably increase correctness.
Gains on one problem family fail to generalize to others.
Current models lose state tracking and produce confidently incorrect outputs at high complexity.
Static benchmarks are insufficient for measuring reasoning robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These limits may stem from the autoregressive nature of LLMs struggling with long-horizon state maintenance.
Architectures with explicit memory or search mechanisms could potentially raise the collapse thresholds.
The identified thresholds provide a quantitative way to compare reasoning capabilities across models.
Similar collapse phenomena might appear in other domains like code generation or theorem proving when complexity is scaled.

Load-bearing premise

The specific ways complexity is increased in each of the nine tasks accurately reflect greater reasoning demands without adding unrelated biases to the problem space.

What would settle it

Observing no significant accuracy drop across low, medium, and high complexity versions of these tasks in a new model or evaluation would falsify the existence of reasoning collapse.

Figures

Figures reproduced from arXiv: 2604.13371 by Akif Islam, Dipankar Das, Md. Fahad Ullah Utsho, Md. Golam Rashed, Mohd. Ruhul Ameen.

**Figure 1.** Figure 1: Tower of Hanoi setup with three pegs (A, B, C) and n disks; only the top disk may be moved, and no larger disk may be placed on a smaller disk. pattern-based strategies. 3.1 Defining the Problem Families We evaluate LRM reasoning using nine controllable puzzle environments spanning recursion, sequential arithmetic, constraint satisfaction, combinatorial search, and spatial planning. Each puzzle admits a… view at source ↗

**Figure 2.** Figure 2: Checker Jumping initial configuration for n red and n blue checkers on a (2n + 1)-cell board with a single empty space. Implications for Reasoning Evaluation. Because each additional disk doubles the optimal trajectory length, local reasoning errors compound rapidly. If each step is correct with probability p, the success probability scales as: Psuccess ≈ p Lmin(n) , leading to sharp collapse once the req… view at source ↗

**Figure 4.** Figure 4: River Crossing: example goal configuration where all actors and agents are safely transported to the right bank [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Valid boat moves in River Crossing: the boat carries a non-empty subset of entities (up to capacity k) between banks while maintaining safety constraints. 3.1.3 River Crossing River Crossing is a safety-constrained sequential planning problem where each intermediate state must satisfy strict validity conditions. Unlike purely structural puzzles (e.g., Tower of Hanoi), River Crossing requires maintaining… view at source ↗

**Figure 6.** Figure 6: Boolean Satisfiability (SAT): CNF formula structure with variables, literals, and clause interactions. Cognitive Load and Failure Modes. River Crossing difficulty arises from: 1. Global safety dependence: each action requires evaluating the entire next configuration. 2. Long-horizon planning: invalid states are often irreversible. 3. Branching under constraints: many syntactically possible boat selection… view at source ↗

**Figure 7.** Figure 7: Cryptarithmetic (alphametic) puzzle: each letter must map to a unique digit, and the decoded arithmetic expression must hold exactly. 3.1.5 Cryptarithmetic Cryptarithmetic is a symbolic constraint satisfaction problem in which each letter maps to a unique digit and an arithmetic equation must hold under this mapping. Because valid solutions require enforcing injectivity, preventing leading zeros, and sa… view at source ↗

**Figure 8.** Figure 8: Graph Coloring example: each vertex receives a color from a fixed palette such that adjacent vertices do not share the same color. density—and correctness depends on enforcing global adjacency constraints—Graph Coloring serves as a principled benchmark for evaluating reasoning under increasing constraint interaction. Formal Structure. Let G = (V, E) be an undirected graph with |V | = n vertices and |E| =… view at source ↗

**Figure 9.** Figure 9: Water Jug setup with two jugs of capacities c1 and c2, illustrating permissible operations (fill, empty, pour). Complexity and Cognitive Load. For random colorings, the probability of avoiding a conflict on any edge is approximately: P(V (f) = 1) ≈ 1 − 1 k m , which decays exponentially as m increases. Graphs near their chromatic threshold (i.e., k ≈ χ(G)) exhibit the strongest interaction constrai… view at source ↗

**Figure 10.** Figure 10: Sudoku grid structure: digits must satisfy row, column, and 3×3 subgrid uniqueness constraints. 3.1.8 Sudoku Sudoku is a dense global constraint satisfaction problem defined over a 9×9 grid. The task is to complete a partially filled grid so that each row, column, and 3 × 3 subgrid contains the digits {1, . . . , 9} exactly once. Difficulty can be controlled by reducing the number of givens and construct… view at source ↗

**Figure 11.** Figure 11: Rubik’s Cube representation and face-turn notation used to describe legal moves. 3.1.9 Rubik’s Cube Rubik’s Cube is a spatial planning and statespace search problem defined over a vast but finite discrete space. Starting from a scrambled configuration of the standard 3 × 3 × 3 cube, the goal is to reach the solved state using a sequence of legal face turns. The task probes complexityinduced limits due … view at source ↗

**Figure 12.** Figure 12: Prompt template for the Tower of Hanoi puzzle, including role specification, rule definitions, and required move-list output format [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 16.** Figure 16: Prompt template for Cryptarithmetic puzzles, outlining digit-uniqueness, leading-zero constraints, and mapping output format. Fail if a valid output is produced but violates constraints, and Collapse if the model fails to produce a parseable or complete solution within the allocated budget. This validation protocol decouples correctness from linguistic plausibility and prevents models from receiving impli… view at source ↗

**Figure 14.** Figure 14: Prompt structure for the River Crossing puzzle, emphasizing capacity limits and the jealous-husbands safety constraint [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 18.** Figure 18: Prompt template for the Water Jug puzzle, including the fixed action space and structured move-list format [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt template for Sudoku, using the 81-character string representation for puzzle input and output. cess rates, we analyze qualitative changes in reasoning traces across complexity levels, including trace length, internal consistency, repetition, and premature termination. These metrics enable identification of reasoning collapse thresholds and characterization of failure modes beyond raw accuracy.… view at source ↗

**Figure 20.** Figure 20: Prompt template for the 3×3 Rubik’s Cube, using WCA notation for all legal face-turns [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: LLM task performance across Tower of Hanoi complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: LLM outcomes on the Water Jug task across increasing difficulty. [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: Model robustness on Boolean SAT across clause-complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: Performance trajectories on the Checker Jumping puzzle. [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

**Figure 25.** Figure 25: Graph Coloring pass/fail/collapse profiles across complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

**Figure 26.** Figure 26: LLM performance on River Crossing under escalating multi-agent constraints. [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗

**Figure 27.** Figure 27: Rubik’s Cube performance across permutation-complexity levels. [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗

**Figure 28.** Figure 28: Sudoku performance, illustrating extreme sensitivity to symbolic-consistency violations. [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗

**Figure 31.** Figure 31: Global error distribution per model, showing proportions of correct outputs, incorrect outputs, and collapses across all tasks and complexity levels. 5.5 Global Collapse Thresholds Across Tasks and Models [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗

**Figure 30.** Figure 30: Global mean pass rate as a function of complexity level. models tend to fail “gracefully,” producing structured but incorrect outputs, whereas weaker models exhibit catastrophic collapses, highlighting fundamental architectural differences in reasoning robustness [PITH_FULL_IMAGE:figures/full_fig_p035_30.png] view at source ↗

**Figure 32.** Figure 32: Global collapse threshold summary showing, for each model, the highest complexity level Lk at which valid solutions are reliably produced across all nine reasoning tasks. deepest scaling, whereas constraint-dense tasks (e.g., Boolean SAT, Graph Coloring) and highbranching combinatorial tasks (e.g., Checker Jumping, Rubik’s Cube) induce earlier collapse. This global trend underscores that collapse thresh… view at source ↗

**Figure 33.** Figure 33: Global Failure Mode Map across tasks and difficulty levels. Outcomes are encoded as: [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: Global mean-performance heatmap aggregating pass rates across all models and tasks. [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗

**Figure 35.** Figure 35: Success Score Matrix showing normalized mean pass rates per task–model pair. [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗

**Figure 36.** Figure 36: Aggregate model strength, measured [PITH_FULL_IMAGE:figures/full_fig_p038_36.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik's Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper measures LLM accuracy drops across scaled versions of nine classic puzzles and calls the pattern reasoning collapse, but the parameterization likely mixes reasoning depth with state space growth.

read the letter

The paper's core finding is that several LLMs maintain decent accuracy on easy versions of these nine tasks but then fall off sharply once complexity parameters are turned up, with lots of invalid outputs and lost state tracking. They formalize this as reasoning collapse and argue for better complexity-aware evaluations. The work does a few things right. It picks a good mix of problems that have clear validity rules and can be scaled in discrete steps: SAT by clauses, Hanoi by disks, Sudoku by clues, and so on. Using only fully valid solutions as success is a solid choice that avoids the usual partial-credit issues in reasoning benchmarks. The qualitative notes on inconsistent traces and confident mistakes add some color to the numbers. The point that gains don't transfer across problem families is also a fair observation from the results. Where it gets thinner is the claim that this is a phase-transition phenomenon driven by reasoning limits. The stress test note is on point here. When you crank up the complexity knobs, you are usually also making the search space bigger and the shortest correct answer longer. LLMs can struggle with that for straightforward reasons like running out of context or accumulating small errors over more steps. The abstract does not mention any experiments that fix solution length or state-space size while changing only the depth of reasoning required. Without those, it's hard to say the drop is specifically about reasoning collapse rather than surface-level modeling difficulty. The lack of reported sample sizes, error bars, or statistical tests in the summary also makes the 'consistent' part harder to judge from the outside. This is the kind of paper that belongs in the evaluation and benchmarking corner of the field. Readers who care about how LLMs actually scale on structured problems will get something out of the task suite and the raw degradation curves. It is worth sending to peer review because the experimental design is transparent enough to critique and improve, and the directional result is plausible even if the interpretation needs work. A referee could ask for tighter controls and more quantitative backing without throwing the whole thing out.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a controlled benchmarking framework using nine classical reasoning tasks (Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, Rubik's Cube) parameterized by increasing complexity levels while preserving semantics. It evaluates multiple open and proprietary Large Reasoning Models with deterministic validators, reporting high accuracy at low complexity but sharp degradation (often >50%) beyond task-specific thresholds, accompanied by inconsistent traces, constraint violations, and loss of state tracking. The authors formalize this as 'reasoning collapse' and argue that increased reasoning length does not reliably improve performance and that gains do not generalize across problem families, calling for evaluation methods beyond static benchmarks.

Significance. If the complexity controls are shown to isolate reasoning demand without confounds, the work would provide valuable empirical evidence of phase-transition-like limits in LLM reasoning robustness, shifting focus from aggregate benchmark scores to controlled scaling studies. The use of deterministic validators and a diverse task suite strengthens the empirical grounding, though the absence of detailed controls and statistics limits immediate generalizability to broader reasoning capabilities.

major comments (2)

[Abstract] Abstract and described methodology: The claim that tasks are 'parameterized to precisely control complexity while preserving underlying semantics' does not specify how parameters (e.g., clause count in SAT, disk count in Hanoi, clue count in Sudoku) hold solution length or state-space cardinality fixed while varying only logical depth. This is load-bearing for the central 'reasoning collapse' claim, as the observed accuracy drops could arise from token-budget exhaustion or increased constraint-violation probability rather than any collapse in reasoning.
[Results] Results and evaluation sections: No details are provided on sample sizes per complexity regime, number of trials, statistical tests for the phase-transition behavior, or full per-task/per-model accuracy tables. This makes the assertions of 'consistent' behavior and 'often exceeding 50%' declines unverifiable and weakens the cross-task generalization claim.

minor comments (2)

[Abstract] The phrase 'phase transition like behavior' should be hyphenated as 'phase-transition-like' for clarity.
[Introduction] The manuscript would benefit from explicit comparison to prior work on LLM reasoning limits (e.g., studies on chain-of-thought scaling or constraint satisfaction benchmarks) to better situate the novelty of the 'reasoning collapse' formalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which help clarify the presentation of our controlled benchmarking framework and strengthen the empirical claims. We address each major comment below and commit to revisions that enhance methodological transparency and statistical rigor without altering the core findings.

read point-by-point responses

Referee: [Abstract] Abstract and described methodology: The claim that tasks are 'parameterized to precisely control complexity while preserving underlying semantics' does not specify how parameters (e.g., clause count in SAT, disk count in Hanoi, clue count in Sudoku) hold solution length or state-space cardinality fixed while varying only logical depth. This is load-bearing for the central 'reasoning collapse' claim, as the observed accuracy drops could arise from token-budget exhaustion or increased constraint-violation probability rather than any collapse in reasoning.

Authors: We agree that explicit parameterization details are essential to substantiate the isolation of reasoning demand. The revised manuscript will expand the methodology section with a per-task table and description showing how parameters were selected: for SAT, clause-to-variable ratio is increased while holding variable count fixed (thus controlling state-space size); for Tower of Hanoi, disk count is varied but solution length grows predictably with the standard recursive strategy; for Sudoku, clue count is reduced while preserving unique solvability and minimal solution steps. We will explicitly note that complete decoupling of logical depth from state-space cardinality is not always feasible across all tasks and will add a limitations subsection discussing potential confounds such as token limits and validator strictness. These additions will allow readers to evaluate whether the observed drops reflect reasoning collapse rather than resource exhaustion. revision: yes
Referee: [Results] Results and evaluation sections: No details are provided on sample sizes per complexity regime, number of trials, statistical tests for the phase-transition behavior, or full per-task/per-model accuracy tables. This makes the assertions of 'consistent' behavior and 'often exceeding 50%' declines unverifiable and weakens the cross-task generalization claim.

Authors: We concur that the current presentation lacks sufficient statistical detail for full verifiability. In the revision we will add a new 'Evaluation Protocol' subsection specifying: 100 problem instances per complexity level per task, 3 independent trials per instance with temperature 0 for determinism where possible, and use of binomial confidence intervals plus paired t-tests to assess significance of accuracy drops across regimes. Complete per-task, per-model accuracy tables (including exact decline percentages) will be moved to the appendix, with summary statistics in the main results section. These changes will directly support the claims of consistent phase-transition behavior and cross-task patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with explicit task parameterization

full rationale

The paper conducts a controlled empirical study across nine fixed tasks with manually defined complexity parameters (e.g., clause count, disk count). No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear. 'Reasoning collapse' is introduced as a descriptive label for observed accuracy drops, not derived from prior equations or self-citations. Complexity controls are stated directly in the task construction and evaluated against external deterministic validators, keeping the chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about task representativeness and validator accuracy rather than free parameters or new entities; complexity controls are described as precise but not fitted to data.

axioms (2)

domain assumption The nine selected classical reasoning tasks can be parameterized to control complexity while preserving underlying semantics and validity constraints.
Invoked when constructing the benchmark suite and defining low/intermediate/high regimes.
domain assumption Deterministic validators provide accurate, unbiased assessment of solution validity across all complexity levels.
Used to filter outputs and ensure only fully valid solutions are counted as correct.

invented entities (1)

reasoning collapse no independent evidence
purpose: To label the observed sharp performance degradation and associated failure modes as complexity increases.
Introduced as a formalization of the empirical pattern; no independent evidence outside the reported observations is provided.

pith-pipeline@v0.9.0 · 5579 in / 1371 out tokens · 41994 ms · 2026-05-10T14:08:54.888642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 15 canonical work pages · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in ":" * " " * FUNCTION f...
[2]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in Large Language Models , 2022

2022
[3]

Large Language Models are zero-shot reasoners, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are zero-shot reasoners, 2022

2022
[4]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2022 a

2022
[5]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

Aarohi Srivastava, Abhinav Rastogi, Abhishek Sharma, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022

2022
[6]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, and Reiichiro Nakano. Training verifiers to solve math word problems, 2021

2021
[7]

Measuring mathematical problem solving with the MATH dataset, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset, 2021

2021
[8]

Proving test set contamination in black-box language models, 2023

Yonatan Oren, Nicole Meister, Nitish Gupta, Tuhin Chakrabarty, Jonathan Valter, Thomas Steinke, Matt Tancik, Danqi Chen, Percy Liang, Sergey Levine, et al. Proving test set contamination in black-box language models, 2023

2023
[9]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://machinelearning.apple.com/research/illusion-of-thinking. Apple Machine Learning Research, published June 2025

2025
[11]

A survey on Large Language Model benchmarks, 2025

Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, and Min Yang. A survey on Large Language Model benchmarks, 2025

2025
[13]

Bean, Ryan Othniel Kearns, et al

Andrew M. Bean, Ryan Othniel Kearns, et al. Measuring what matters: Construct validity in large language model benchmarks. In Advances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track

2025
[14]

ReasonBENCH : Benchmarking the (in)stability of LLM reasoning, 2025

Nearchos Potamitis, Lars Klein, and Akhil Arora. ReasonBENCH : Benchmarking the (in)stability of LLM reasoning, 2025

2025
[15]

Dyval: Dynamic evaluation of large language models for reasoning tasks

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gjfOL9z5Xr

2024
[16]

Benchmarking rule-based reasoning abilities of large language models

Jiayi Gui, Yiming Liu, Jiale Cheng, et al. Benchmarking rule-based reasoning abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2025, 2025 a . URL https://aclanthology.org/2025.findings-acl.77

2025
[18]

Alibaba cloud model studio: Qwen Plus , 2025

Alibaba Cloud . Alibaba cloud model studio: Qwen Plus , 2025. URL https://www.alibabacloud.com/help/en/model-studio/models

2025
[19]

Kimi K2 : Open agentic intelligence, 2025

Moonshot AI . Kimi K2 : Open agentic intelligence, 2025. URL https://github.com/MoonshotAI/Kimi-K2.5

2025
[20]

Claude 3.7 Sonnet system card

Anthropic . Claude 3.7 Sonnet system card. Technical report, Anthropic, 2025

2025
[21]

Gemini 3 Pro preview documentation, 2026

Google DeepMind . Gemini 3 Pro preview documentation, 2026. URL https://deepmind.google/technologies/gemini/

2026
[22]

GPT-5 architecture and capabilities, 2026

OpenAI . GPT-5 architecture and capabilities, 2026. URL https://openai.com/research/

2026
[23]

DeepSeek-V3.2 : Pushing the frontier of open Large Language Models , 2025

DeepSeek-AI . DeepSeek-V3.2 : Pushing the frontier of open Large Language Models , 2025

2025
[24]

A survey on evaluation of Large Language Models , 2023

Anonymous. A survey on evaluation of Large Language Models , 2023

2023
[25]

Construct validity in Large Language Model benchmarks, 2025

Anonymous. Construct validity in Large Language Model benchmarks, 2025. NeurIPS 2025 Datasets and Benchmarks Track

2025
[26]

Zhu et al

K. Zhu et al. DyVal : Dynamic evaluation of Large Language Models for controlled complexity, 2025. OpenReview: DyVal framework

2025
[27]

Gui et al

J. Gui et al. Benchmarking rule-based reasoning abilities of Large Language Models . In Findings of the Association for Computational Linguistics (ACL), 2025 b

2025
[28]

Saha et al

S. Saha et al. Learning to plan & reason for evaluation with thinking LLM-as-a-Judge , 2025 b

2025
[29]

Beyond accuracy: Evaluating the reasoning behavior of Large Language Models , 2024

Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of Large Language Models , 2024

2024
[30]

The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance?, 2024

Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance?, 2024

2024
[31]

Magistral : Mistral's first reasoning model, 2025

Mistral AI . Magistral : Mistral's first reasoning model, 2025

2025
[32]

title Alibaba cloud model studio: Qwen Plus

author Alibaba Cloud , year 2025 . title Alibaba cloud model studio: Qwen Plus . https://www.alibabacloud.com/help/en/model-studio/models

2025
[33]

Yu, Qiang Yang, and Xing Xie

author Anonymous , year 2023 . title A survey on evaluation of Large Language Models . arXiv:2307.03109 http://arxiv.org/abs/2307.03109

work page arXiv 2023
[34]

title Construct validity in Large Language Model benchmarks

author Anonymous , year 2025 . title Construct validity in Large Language Model benchmarks . note NeurIPS 2025 Datasets and Benchmarks Track

2025
[35]

title Claude 3.7 Sonnet System Card

author Anthropic , year 2025 . title Claude 3.7 Sonnet System Card . type Technical Report . Anthropic

2025
[36]

Aaron Chatterji, Thomas Cunningham, David J Dem- ing, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman

author Banerjee, S. , author Agarwal, A. , author Singh, E. , year 2024 . title The vulnerability of language model benchmarks: Do they accurately reflect true LLM performance? arXiv:2412.03597 http://arxiv.org/abs/2412.03597

work page arXiv 2024
[37]

Training Verifiers to Solve Math Word Problems

author Cobbe, K. , author Kosaraju, V. , author Bavarian, M. , author Chen, M. , author Jun, H. , author Kaiser, L. , author Plappert, M. , author Tworek, J. , author Hilton, J. , author Nakano, R. , year 2021 . title Training verifiers to solve math word problems . arXiv:2110.14168 http://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

author DeepSeek-AI , year 2025 . title DeepSeek-V3.2 : Pushing the frontier of open Large Language Models . arXiv:2512.02556 http://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

title Gemini 3 Pro preview documentation

author Google DeepMind , year 2026 . title Gemini 3 Pro preview documentation . https://deepmind.google/technologies/gemini/

2026
[40]

, et al., year 2025

author Gui, J. , et al., year 2025 . title Benchmarking rule-based reasoning abilities of Large Language Models , in: booktitle Findings of the Association for Computational Linguistics (ACL)

2025
[41]

Measuring Mathematical Problem Solving With the MATH Dataset

author Hendrycks, D. , author Burns, C. , author Basart, S. , author Zou, A. , author Mazeika, M. , author Song, D. , author Steinhardt, J. , year 2021 . title Measuring mathematical problem solving with the MATH dataset . arXiv:2103.03874 http://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Large Language Models are Zero-Shot Reasoners

author Kojima, T. , author Gu, S.S. , author Reid, M. , author Matsuo, Y. , author Iwasawa, Y. , year 2022 . title Large Language Models are zero-shot reasoners . arXiv:2205.11916 http://arxiv.org/abs/2205.11916

work page internal anchor Pith review arXiv 2022
[43]

Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al

author Mistral AI , year 2025 . title Magistral : Mistral's first reasoning model . arXiv:2506.10910 http://arxiv.org/abs/2506.10910

work page arXiv 2025
[44]

Beyond accuracy: evaluating the reasoning behavior of large language models–a survey.arXiv preprint arXiv:2404.01869, 2024

author Mondorf, P. , author Plank, B. , year 2024 . title Beyond accuracy: Evaluating the reasoning behavior of Large Language Models . arXiv:2404.01869 http://arxiv.org/abs/2404.01869

work page arXiv 2024
[45]

title Kimi K2 : Open agentic intelligence

author Moonshot AI , year 2025 . title Kimi K2 : Open agentic intelligence . https://github.com/MoonshotAI/Kimi-K2.5

2025
[46]

A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,

author Ni, S. , author Chen, G. , author Li, S. , author Chen, X. , author Li, S. , author Wang, B. , author Wang, Q. , author Wang, X. , author Zhang, Y. , author Fan, L. , author Li, C. , author Xu, R. , author Sun, L. , author Yang, M. , year 2025 . title A survey on Large Language Model benchmarks . arXiv:2508.15361 http://arxiv.org/abs/2508.15361

work page arXiv 2025
[47]

title GPT-5 architecture and capabilities

author OpenAI , year 2026 . title GPT-5 architecture and capabilities . https://openai.com/research/

2026
[48]

Proving test set contamination in black box language models

author Oren, Y. , author Meister, N. , author Gupta, N. , author Chakrabarty, T. , author Valter, J. , author Steinke, T. , author Tancik, M. , author Chen, D. , author Liang, P. , author Levine, S. , et al., year 2023 . title Proving test set contamination in black-box language models . arXiv:2310.17623 http://arxiv.org/abs/2310.17623

work page arXiv 2023
[49]

, author Klein, L

author Potamitis, N. , author Klein, L. , author Arora, A. , year 2025 . title ReasonBENCH : Benchmarking the (in)stability of LLM reasoning . arXiv:2512.07795 http://arxiv.org/abs/2512.07795

work page arXiv 2025
[50]

Learning to plan & reason for evaluation with thinking-llm-as-a-judge, 2025

author Saha, S. , et al., year 2025 . title Learning to plan & reason for evaluation with thinking LLM-as-a-Judge . arXiv:2501.18099 http://arxiv.org/abs/2501.18099

work page arXiv 2025
[51]

, author Mirzadeh, I

author Shojaee, P. , author Mirzadeh, I. , author Alizadeh, K. , author Horton, M. , author Bengio, S. , author Farajtabar, M. , year 2025 . title The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity . https://machinelearning.apple.com/research/illusion-of-thinking. note apple Machine...

2025
[52]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

author Srivastava, A. , author Rastogi, A. , author Sharma, A. , et al., year 2022 . title Beyond the imitation game: Quantifying and extrapolating the capabilities of language models . arXiv:2206.04615 http://arxiv.org/abs/2206.04615

work page internal anchor Pith review arXiv 2022
[53]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

author Wang, X. , author Wei, J. , author Schuurmans, D. , author Le, Q.V. , author Chi, E.H. , author Zhou, D. , year 2022 . title Self-consistency improves chain of thought reasoning in language models . arXiv:2203.11171 http://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

author Wei, J. , author Wang, X. , author Schuurmans, D. , author Bosma, M. , author Xia, F. , author Chi, E.H. , author Le, Q.V. , author Zhou, D. , year 2022 . title Chain-of-thought prompting elicits reasoning in Large Language Models . arXiv:2201.11903 http://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

, et al., year 2025

author Zhu, K. , et al., year 2025 . title DyVal : Dynamic evaluation of Large Language Models for controlled complexity . note OpenReview: DyVal framework

2025