pith. sign in

arxiv: 2605.06638 · v3 · pith:WFMWIHAPnew · submitted 2026-05-07 · 💻 cs.AI · cs.CL

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Pith reviewed 2026-05-20 22:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reinforcement learninglarge language modelslong-horizon reasoninglogical expressivenessscaling lawssynthetic benchmarksdownstream transferautoregressive transformers
0
0 comments X

The pith

Reinforcement learning can overcome long-horizon reasoning limits in LLMs when trained on sufficiently expressive logics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether reinforcement learning can equip large language models with the ability to manage extended sequences of logical steps. It introduces ScaleLogic, a controlled synthetic setting that lets researchers adjust the length of the required reasoning chain separately from the richness of the logical operations available. Results show that the compute needed to train effectively grows according to a power law as reasoning depth increases, and the steepness of that growth rises when the logic includes conjunction, disjunction, negation, and quantification. Training under the more expressive conditions produces larger improvements on mathematics and general reasoning benchmarks along with more efficient transfer of the acquired skills. These patterns indicate that observed difficulties with long reasoning chains are not fixed properties of the autoregressive architecture but can be mitigated through choices in training data and procedure.

Core claim

ScaleLogic provides independent control over reasoning depth and logical expressiveness, spanning from simple implication to first-order logic with and, or, not, and for-all. RL training compute scales as a power law with depth whose exponent increases monotonically from 1.04 to 2.60 as expressiveness grows. Higher-expressiveness training yields gains up to 10.66 points on downstream benchmarks and more compute-efficient transfer, while the power-law relation holds across RL methods and curriculum training improves scaling efficiency. The results establish that LLM shortcomings in long-horizon reasoning are not fundamental to the transformer architecture and can be addressed by improved data

What carries the argument

ScaleLogic, a synthetic logical reasoning environment that independently varies proof-planning depth from the expressiveness of the underlying logic (simple implication to full first-order with conjunction, disjunction, negation, and universal quantification).

If this is right

  • Training compute follows a power law with reasoning depth across multiple RL methods.
  • More expressive logical training produces larger performance gains on mathematics and general reasoning benchmarks.
  • Curriculum-based training improves scaling efficiency for greater reasoning depths.
  • The character of the training data, not merely its volume, determines how effectively skills transfer to new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the synthetic scaling patterns hold outside the lab, curricula that gradually increase logical expressiveness could optimize compute allocation for complex sequential problems.
  • The same controlled-variation approach could be applied to study scaling in adjacent areas such as multi-step planning or program synthesis.
  • Prioritizing richer logical structures during data construction might lessen reliance on architectural modifications for stronger long-horizon performance.

Load-bearing premise

Performance improvements measured inside the controlled synthetic ScaleLogic tasks will transfer to and predict gains on real-world long-horizon reasoning problems.

What would settle it

Apply the same high-expressiveness RL training to a held-out collection of multi-step real-world tasks such as complex math word problems or planning scenarios and observe whether downstream accuracy fails to exceed that of low-expressiveness training.

Figures

Figures reproduced from arXiv: 2605.06638 by Abulhair Saparov, Guangchen Lan, Guanwen Qiu, Sipeng Zhang, Tianle Wang, Xinpeng Wei, Zhaoyang Wang.

Figure 1
Figure 1. Figure 1: Overview of SCALELOGIC. Each problem has B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic setting (referred to as + Quantification in Section 3.2) combines conjunction, disjunction, negation, and universal quantification. 2 Rela… view at source ↗
Figure 2
Figure 2. Figure 2: Training cost scales as a power law with reasoning depth, with exponent γ governed by logical expressiveness. (a) Scatter points show the training steps T required to reach the 90% accuracy threshold as a function of reasoning depth D. Solid lines show power-law fits T ∝ D γ . (b) Fitted exponent γ increases monotonically when expressiveness increases from Implication-only (γ = 1.04) to + Quantification (γ… view at source ↗
Figure 3
Figure 3. Figure 3: Downstream transfer from synthetic reasoning training. (a) All settings outperform the base model, with richer logical settings producing larger and more sustained gains. (b) Average downstream performance across logical settings under fixed depth (D = 12) and fixed compute (∼ 100 steps). Both controls exhibit the same monotone trend: More expressive training settings yield stronger downstream performance.… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of training distribution and RL algorithm on scaling efficiency in the + Conjunction setting. Each curve aggregates three independent seeds; shading denotes ±1 standard deviation in log-space across seeds. (a) Curriculum training yields the lowest exponent (γ = 1.33), whereas difficult-only training yields the highest (γ = 2.36). (b) All RL algorithms follow power-law scaling (R 2 > 0.99), with expo… view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-distribution generalization across reasoning depths. (a) Increasing the training depth consistently extends the range of solvable test depths. (b) OOD generalization remains bounded: even the models trained at the largest depths fall to random at Dtest/Dtrain ≈3. additional algorithms: base GRPO [Shao et al., 2024] (without DAPO extension) and GSPO [Zheng et al., 2025], a recent sequence-level polic… view at source ↗
Figure 6
Figure 6. Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. view at source ↗
Figure 6
Figure 6. Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text view at source ↗
Figure 7
Figure 7. Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing view at source ↗
Figure 8
Figure 8. Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of candidate count B at fixed depth D = 8 under the + Quantification setting. (a) Training steps to 90% accuracy follow a power law in B (γB = 1.41, R 2 = 0.984, ∆AIC = +7.0 vs. exponential). (b) Average downstream performance across the eight reasoning benchmarks of Section 4.3, plotted against candidate count B. Gains saturate quickly: performance rises from 52.5% at B = 2 to 55.7% at B = 4 (+3.2 … view at source ↗
Figure 9
Figure 9. Figure 9: Effect of candidate count B at fixed depth D = 8 under the + Quantification setting. (a) Training steps to 90% accuracy follow a power law in B (γB = 1.41, R 2 = 0.984, ∆AIC = +7.0 vs. exponential). (b) Average downstream performance across the eight reasoning benchmarks of Section 4.3, plotted against candidate count B. Gains saturate quickly: performance rises from 52.5% at B = 2 to 55.7% at B = 4 (+3.2 … view at source ↗
Figure 10
Figure 10. Figure 10: Uniform vs. curriculum training under the most expressive view at source ↗
Figure 10
Figure 10. Figure 10: Uniform vs. curriculum training under the most expressive [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training trajectories on + Conjunction under three data distributions. Each row: response length (left), entropy (middle), validation accuracy (right); legend (training depth D) is shared across the three sub-panels. Note that axis ranges may differ across rows to improve readability. The scaling behavior at 8B closely mirrors that at 4B. All five expressiveness settings continue to follow a clean power l… view at source ↗
Figure 11
Figure 11. Figure 11: Training trajectories on + Conjunction under three data distributions. Each row: response length (left), entropy (middle), validation accuracy (right); legend (training depth D) is shared across the three sub-panels. Note that axis ranges may differ across rows to improve readability. The scaling behavior at 8B closely mirrors that at 4B. All five expressiveness settings continue to follow a clean power l… view at source ↗
Figure 12
Figure 12. Figure 12: Cross-scale replication on Qwen3-8B. (a) Training steps to convergence vs. reasoning depth across five expressiveness levels on log-log axes. Solid lines show power-law fits T ∝ D γ . (b) Fitted γ increases monotonically when expressiveness increases from Implication-only (γ = 0.99) to the + Quantification setting (γ = 2.53), mirroring the 4B picture in view at source ↗
Figure 12
Figure 12. Figure 12: Cross-scale replication on Qwen3-8B. (a) Training steps to convergence vs. reasoning depth across five expressiveness levels on log-log axes. Solid lines show power-law fits T ∝ D γ . (b) Fitted γ increases monotonically when expressiveness increases from Implication-only (γ = 0.99) to the + Quantification setting (γ = 2.53), mirroring the 4B picture in [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation view at source ↗
Figure 13
Figure 13. Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Number of correct answers out of 30 for frontier LLMs on view at source ↗
Figure 14
Figure 14. Figure 14: Number of correct answers out of 30 for frontier LLMs on [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MATH-500 #80: full reasoning trace of our model (top, view at source ↗
Figure 15
Figure 15. Figure 15: MATH-500 #80: full reasoning trace of our model (top, [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that they are fundamental to the autoregressive transformer architecture. To address this, we introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ScaleLogic, a synthetic logical reasoning framework that independently varies proof depth (horizon) and logical expressiveness (from implication-only to full FOL with quantifiers). It reports that RL training compute T scales as a power law with depth D (T ∝ D^γ, R² > 0.99), that the exponent γ rises monotonically from 1.04 to 2.60 with increasing expressiveness, and that more expressive training yields larger downstream gains (up to +10.66 points) and better compute-efficient transfer on mathematics and general reasoning benchmarks. The central claim is that long-horizon reasoning deficits are not fundamental to autoregressive transformers and can be addressed through improved training methodology and data; the power-law relationship is shown to hold across multiple RL methods and to improve under curriculum training.

Significance. If the empirical scaling laws and transfer results hold under rigorous controls, the work would provide a valuable controlled testbed for studying how training data expressiveness affects long-horizon reasoning scaling, offering evidence against purely architectural explanations of LLM limitations. The reproducible power-law fits across RL methods and the curriculum improvement are concrete strengths that could inform future data-curation strategies.

major comments (3)
  1. [Abstract / Results] Abstract and methods (inferred from reported fits): the power-law claim T ∝ D^γ with R² > 0.99 is presented without specifying the number of depth points fitted, the regression procedure (ordinary least squares on log-log or otherwise), error bars on γ, or controls for confounding factors such as total tokens or model size; this makes it impossible to assess whether the reported monotonic increase in γ with expressiveness is robust or sensitive to fitting choices.
  2. [Abstract] Abstract: the downstream transfer gains of up to +10.66 points and 'more compute-efficient transfer' are reported without naming the exact benchmarks, the base model sizes, the number of runs, or any statistical significance tests; without these details the claim that expressiveness shapes transfer more than compute volume cannot be evaluated for load-bearing status.
  3. [Abstract / Conclusion] Abstract / Discussion: the conclusion that shortcomings 'are not fundamental to the underlying architecture' rests on transfer from ScaleLogic to real-world mathematics and reasoning benchmarks, yet the manuscript supplies no analysis of how closely those benchmarks match the controlled depth and expressiveness axes or whether gains survive distribution shift to noisy, open-ended problems.
minor comments (2)
  1. [Abstract] Notation: the scaling exponent is introduced as γ but later referenced in the abstract as both γ and the 'scaling exponent'; consistent symbol use would improve readability.
  2. [Abstract] The abstract states that the power-law 'holds across multiple RL methods' but does not list which methods were tested; adding this enumeration would clarify the generality claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important points of clarification that will improve the accessibility and rigor of the presentation. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and methods (inferred from reported fits): the power-law claim T ∝ D^γ with R² > 0.99 is presented without specifying the number of depth points fitted, the regression procedure (ordinary least squares on log-log or otherwise), error bars on γ, or controls for confounding factors such as total tokens or model size; this makes it impossible to assess whether the reported monotonic increase in γ with expressiveness is robust or sensitive to fitting choices.

    Authors: We agree that the abstract would benefit from these details to allow readers to evaluate the robustness of the reported scaling. The regressions were performed via ordinary least squares on log-log coordinates using the depth points shown in the main figures; model size was held fixed and total training tokens were controlled by adjusting episode count per depth. In the revision we will state the number of fitted points, the regression method, and the controls directly in the abstract, add error bars on γ to the relevant plots, and expand the Methods section with the precise fitting procedure and any sensitivity checks. revision: yes

  2. Referee: [Abstract] Abstract: the downstream transfer gains of up to +10.66 points and 'more compute-efficient transfer' are reported without naming the exact benchmarks, the base model sizes, the number of runs, or any statistical significance tests; without these details the claim that expressiveness shapes transfer more than compute volume cannot be evaluated for load-bearing status.

    Authors: The referee is correct that the abstract is too terse on these points. The full manuscript already reports the specific benchmarks, model sizes, and run counts in the experimental sections. To address the concern we will explicitly name the benchmarks, state the base model sizes, indicate the number of runs, and report statistical significance in the revised abstract so that the transfer claims can be assessed without reference to later sections. revision: yes

  3. Referee: [Abstract / Conclusion] Abstract / Discussion: the conclusion that shortcomings 'are not fundamental to the underlying architecture' rests on transfer from ScaleLogic to real-world mathematics and reasoning benchmarks, yet the manuscript supplies no analysis of how closely those benchmarks match the controlled depth and expressiveness axes or whether gains survive distribution shift to noisy, open-ended problems.

    Authors: We accept that an explicit bridging analysis would strengthen the interpretation. In the revision we will add a short discussion subsection that maps the logical depth and expressiveness present in the downstream benchmarks onto the ScaleLogic axes and comments on the degree of distribution shift. We will also qualify the architectural claim to reflect that the evidence is strongest for the evaluated benchmarks while noting that further work on noisier, open-ended tasks remains valuable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper reports empirical observations of power-law scaling (T ∝ D^γ with R² > 0.99) between RL training compute and reasoning depth in the ScaleLogic synthetic framework, with γ increasing monotonically as logical expressiveness grows from implication-only to full FOL. These relationships are presented as fitted results across multiple RL methods and curriculum variants rather than as predictions derived from a self-referential equation or ansatz that reduces to the input data by construction. The central claim—that long-horizon reasoning deficits are not fundamental to autoregressive transformers—rests on experimental transfer gains to math and reasoning benchmarks, without load-bearing self-citations, uniqueness theorems imported from prior author work, or renaming of known results as novel derivations. The framework's independent axes of depth and expressiveness further ensure the reported scaling and transfer findings remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the synthetic logical tasks capture the essential structure of long-horizon reasoning difficulties and that observed scaling and transfer generalize beyond the testbed.

free parameters (1)
  • scaling exponent gamma
    Observed values ranging from 1.04 to 2.60 are reported as outcomes of fitting training compute to depth for each expressiveness level.
axioms (1)
  • domain assumption The synthetic logical reasoning tasks in ScaleLogic reflect core difficulties of real-world long-horizon reasoning.
    Invoked when generalizing power-law scaling and downstream benchmark gains to broader LLM reasoning capabilities.
invented entities (1)
  • ScaleLogic framework no independent evidence
    purpose: Provide independent control over reasoning horizon and logical expressiveness for controlled RL experiments.
    Newly introduced synthetic environment whose validity underpins all reported scaling and transfer results.

pith-pipeline@v0.9.0 · 5870 in / 1357 out tokens · 38525 ms · 2026-05-20T22:32:39.924890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 35 internal anchors

  1. [1]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

  2. [2]

    arXiv preprint arXiv:2212.09689 , year=

    Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. arXiv preprint arXiv:2212.09689 , year=

  3. [3]

    Contextual Integrity in

    Guangchen Lan and Huseyin A Inan and Sahar Abdelnabi and Janardhan Kulkarni and Lukas Wutschitz and Reza Shokri and Christopher Brinton and Robert Sim , booktitle=. Contextual Integrity in

  4. [4]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. arXiv preprint arXiv:2306.02707 , year=

  5. [5]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

  6. [6]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. arXiv preprint arXiv:2308.09583 , year=

  7. [7]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

  8. [8]

    arXiv preprint arXiv:2407.03502 , year=

    AgentInstruct: Toward Generative Teaching with Agentic Flows , author=. arXiv preprint arXiv:2407.03502 , year=

  9. [14]

    arXiv preprint arXiv:2305.13691 , year=

    Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author=. arXiv preprint arXiv:2305.13691 , year=

  10. [16]

    International Conference on Learning Representations (ICLR) , year=

    Learning from Synthetic Data Improves Multi-hop Reasoning , author=. International Conference on Learning Representations (ICLR) , year=

  11. [22]

    Large Language Models are Zero-Shot Reasoners

    Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

  12. [23]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

  13. [37]

    International Conference on Learning Representations , year=

    Transformers Struggle to Learn to Search , author=. International Conference on Learning Representations , year=

  14. [38]

    Advances in Neural Information Processing Systems , year=

    Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. Advances in Neural Information Processing Systems , year=

  15. [49]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  16. [50]

    2023 , note =

    AMC 12 Problems and Solutions , author =. 2023 , note =

  17. [51]

    2025 , note =

    AIME Problems and Solutions , author =. 2025 , note =

  18. [53]

    Proceedings of the 42nd International Conference on Machine Learning , year=

    GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

  19. [54]

    2025 , eprint=

    seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs , author=. 2025 , eprint=

  20. [55]

    International Conference on Machine Learning , year=

    ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning , author=. International Conference on Machine Learning , year=

  21. [56]

    Advances in Neural Information Processing Systems , year=

    Are Language Models Efficient Reasoners? A Perspective from Logic Programming , author=. Advances in Neural Information Processing Systems , year=

  22. [57]

    Tools and Algorithms for the Construction and Analysis of Systems , pages =

    de Moura, Leonardo and Bj. Tools and Algorithms for the Construction and Analysis of Systems , pages =. 2008 , publisher =

  23. [62]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  24. [63]

    The Pitfalls of Next-Token Prediction , booktitle =

    Gregor Bachmann and Vaishnavh Nagarajan , editor =. The Pitfalls of Next-Token Prediction , booktitle =. 2024 , url =

  25. [64]

    The pitfalls of next-token prediction

    Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Re...

  26. [65]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, and Mingxuan Wang. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025

  27. [66]

    In: Ramakrishnan, C.R., Rehof, J

    Leonardo de Moura and Nikolaj Bj rner. Z3 : An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337--340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24

  28. [67]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025 a

  29. [68]

    G1: Teaching llms to reason on graphs with reinforcement learning

    Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, and Yisen Wang. G1: Teaching llms to reason on graphs with reinforcement learning. arXiv preprint arXiv:2505.18499, 2025 b

  30. [69]

    Resyn: Autonomously scaling synthetic environments for reasoning models

    Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, and Huzefa Rangwala. Resyn: Autonomously scaling synthetic environments for reasoning models. arXiv preprint arXiv:2602.20117, 2026

  31. [70]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

  32. [71]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  33. [72]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010...

  34. [73]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  35. [74]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

  36. [75]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  37. [76]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  38. [77]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  39. [78]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  40. [79]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025

  41. [80]

    Contextual integrity in LLM s via reasoning and reinforcement learning

    Guangchen Lan, Huseyin A Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher Brinton, and Robert Sim. Contextual integrity in LLM s via reasoning and reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  42. [81]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022

  43. [82]

    Zebralogic: On the scaling limits of llms for logical reasoning

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning. In International Conference on Machine Learning, 2025

  44. [83]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  45. [84]

    Saturn: Sat-based reinforcement learning to unleash llms reasoning

    Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, and Yihong Dong. Saturn: Sat-based reinforcement learning to unleash llms reasoning. arXiv preprint arXiv:2505.16368, 2025 a

  46. [85]

    Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

    Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025 b

  47. [86]

    R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

    Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, and Xunliang Cai. R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

  48. [87]

    Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024

  49. [88]

    h1: Bootstrapping llms to reason over longer horizons via reinforcement learning

    Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via reinforcement learning. arXiv preprint arXiv:2510.07312, 2025

  50. [89]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  51. [90]

    Are language models efficient reasoners? a perspective from logic programming

    Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, and Bernhard Sch \"o lkopf. Are language models efficient reasoners? a perspective from logic programming. In Advances in Neural Information Processing Systems, 2025

  52. [91]

    Reasoning models reason well, until they don't

    Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, and Abulhair Saparov. Reasoning models reason well, until they don't. arXiv preprint arXiv:2510.22371, 2025

  53. [92]

    seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025

    Mohammad Ramezanali, Mo Vazifeh, and Paolo Santi. seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025. URL https://arxiv.org/abs/2509.16866

  54. [93]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

  55. [94]

    Transformers struggle to learn to search

    Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search. In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=qFVVBzXxR2V

  56. [95]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  57. [96]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

  58. [97]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

  59. [98]

    Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas K \"o pf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.24760, 2025

  60. [99]

    Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

    Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025

  61. [100]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026

  62. [101]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  63. [102]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023

  64. [103]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

  65. [104]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  66. [105]

    Enhance reasoning for large language models in the game werewolf

    Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024

  67. [106]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025

  68. [107]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  69. [108]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

  70. [109]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  71. [110]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025

  72. [111]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

  73. [112]

    Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? In Proceedings of the 42nd International Conference on Machine Learning, 2025