arxiv: 2605.08905 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Xiaozhe Li , Xinyu Fang , Shengyuan Ding , Yang Li , Linyang Li , Haodong Duan , Qingwen Liu , Kai Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLMsreinforcement learningNP-hard optimizationquality-aware rewardsOPT-BENCHsuccess ratequality ratiogeneralization

0 comments

The pith

Quality-aware reinforcement learning on OPT-BENCH enables small LLMs to find high-quality solutions to NP-hard optimization problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes OPT-BENCH as a framework for training and evaluating LLMs on NP-hard optimization tasks by extending RLVR with quality signals. It provides instance generators, quality verifiers, and continuous rewards across ten tasks so models can improve solution quality beyond mere feasibility. A 7B model trained on 15K examples reaches 93.1 percent success rate and 46.6 percent quality ratio, beating GPT-4o. The same training produces gains on mathematics, logic, knowledge, and instruction-following tasks. Quality-aware rewards drive a 28.8 percent improvement over binary feedback, and task diversity matters more than raw data volume for generalization.

Core claim

OPT-BENCH supplies scalable training infrastructure, a 1,000-instance benchmark measuring both Success Rate and Quality Ratio, and quality-aware rewards that replace binary correctness signals; training Qwen2.5-7B-Instruct-1M on 15K examples produces 93.1 percent SR and 46.6 percent QR while outperforming GPT-4o and transferring to unrelated reasoning domains.

What carries the argument

Quality-aware rewards that assign continuous scores reflecting solution optimality instead of binary feasibility checks.

Load-bearing premise

Quality verifiers can accurately score how close any solution is to the true optimum even when the optimum itself cannot be computed for the test instances.

What would settle it

Compute exact optimal solutions for a fresh set of NP-hard instances and measure whether the quality ratios produced by models trained on OPT-BENCH match the 46.6 percent benchmark figure.

Figures

Figures reproduced from arXiv: 2605.08905 by Haodong Duan, Kai Chen, Linyang Li, Qingwen Liu, Shengyuan Ding, Xiaozhe Li, Xinyu Fang, Yang Li.

**Figure 1.** Figure 1: Overview of the FORGE-ENGINE. The FORGE-BENCH encompasses 10 NP-hard optimization tasks across five categories (e.g., subset selection, path planning), designed to assess reasoning capabilities. An automated pipeline consisting of a Data Generator, Solution Validator, and Heuristic Solver ensures controllable data synthesis, rigorous evaluation, and scalable training. A case study on the Hamiltonian Circui… view at source ↗

**Figure 2.** Figure 2: FORGE-RLVR training pipeline with quality-aware RLVR. The model generates solutions with step-bystep reasoning, which are evaluated through three components: (i) format verification checking output structure, (ii) feasibility verification ensuring constraint satisfaction, and (iii) quality assessment measuring optimality relative to heuristic baselines. The combined reward signal guides model optimization… view at source ↗

**Figure 3.** Figure 3: Comparison of RL training strategies during multi-task training, with performance evaluated on both [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between single linear curriculum learning and curriculum replay strategy under GRPO [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPT-BENCH gives a concrete new framework for quality-aware RL on NP-hard tasks with solid reported gains, but the quality numbers rest on verifiers whose link to true optima is not yet clear.

read the letter

This paper's main contribution is OPT-BENCH, a framework that applies reinforcement learning with verifiable rewards to NP-hard optimization tasks in large language models, focusing on solution quality instead of just feasibility. They train Qwen2.5-7B on 15k examples to reach 93.1% success rate and 46.6% quality ratio, beating GPT-4o by a wide margin, and report that quality-aware rewards boost performance by 28.8% over binary ones. There are also modest transfer gains to math, logic, and other tasks. The new part is the dedicated benchmark with instance generators, quality verifiers, and optimal baselines across 10 tasks, plus the shift to continuous quality rewards. This extends existing RLVR work from correctness-only domains like math and coding into constrained optimization where finding near-optimal solutions matters. The infrastructure for scalable training and the 1000-instance benchmark look like practical additions that others could use. The work is solid in showing concrete results and an ablation study on the reward design. The transfer findings suggest that training on these optimization problems can improve broader reasoning, which is a useful observation even if the gains are small. The main concern is with the quality verifiers. Since these are NP-hard problems, exact optimal solutions are not available for most test instances, so the verifiers rely on heuristics or surrogates. Without clear validation that these verifiers align with true optimality or details on how baselines were constructed, the quality ratio numbers and the superiority claims rest on an assumption that could introduce bias. No error bars are mentioned in the abstract, which makes it harder to gauge reliability. This is a real gap rather than a minor issue, because the central claims depend on it. This paper is aimed at people working on LLM reasoning enhancements and applications to combinatorial problems in logistics or planning. Readers looking for new benchmarks or reward shaping ideas in RLVR will find it relevant. It deserves a serious referee because the framework is novel enough and the questions around quality measurement are important to resolve through review. I recommend putting it through peer review. The ideas have potential, and addressing the verifier validation would strengthen it considerably.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces OPT-BENCH, a framework for training and evaluating LLMs on NP-hard optimization problems via quality-aware Reinforcement Learning with Verifiable Rewards (RLVR). It comprises instance generators, quality verifiers, and optimal baselines for 10 tasks; a 1,000-instance benchmark assessing feasibility via Success Rate (SR) and optimality via Quality Ratio (QR); and quality-aware rewards that support continuous improvement beyond binary correctness. Training Qwen2.5-7B-Instruct-1M on 15K examples yields 93.1% SR and 46.6% QR, outperforming GPT-4o (29.6% SR, 14.6% QR). Quality-aware rewards improve solutions by 28.8% over binary rewards, with positive transfer to mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Task diversity is shown to drive generalization more than data quantity.

Significance. If the quality verifiers are shown to track true optimality, the work would be significant for extending RLVR to optimization quality on NP-hard problems rather than binary correctness alone. The release of generators, verifiers, and baselines for 10 tasks provides a reusable infrastructure that could accelerate research in this area. The transfer gains and analysis of diversity versus quantity offer concrete insights for scaling RLVR on complex reasoning tasks. The benchmark itself is a clear contribution even if the numerical claims require further substantiation.

major comments (2)

[Abstract and §3 (Methods)] Abstract and §3 (Methods): The central numerical claims (93.1% SR / 46.6% QR, 28.8% gain from quality-aware rewards, and superiority over GPT-4o) rest on the Quality Ratio metric. For NP-hard problems where exact optima are unavailable on the 1,000-instance test set, the manuscript must demonstrate that the quality verifiers correlate with true optimality. Validation against known optima on smaller, solvable instances (via exhaustive search or exact solvers) is required; without it, QR and the reported gains risk reflecting verifier heuristics rather than genuine optimization improvement. This is load-bearing for the primary results.
[§4 (Results) and Table 1 (presumed)] §4 (Results) and Table 1 (presumed): No error bars, standard deviations, or statistical tests accompany the reported SR and QR values or the 28.8% improvement. Given the stochasticity of LLM sampling and RL training, multiple independent runs are needed to establish that the outperformance over GPT-4o and the reward-type ablation are reliable rather than run-specific.

minor comments (3)

[Title and Abstract] The title refers to 'Forge' while the abstract and body introduce 'OPT-BENCH'; clarify whether Forge is the RL method, the full system, or a synonym, and ensure consistent nomenclature throughout.
[§4 (Results)] Missing details on baseline construction: how GPT-4o and other models were prompted or sampled for the benchmark (e.g., temperature, few-shot examples, decoding strategy) should be specified to enable reproduction.
[§5 (Transfer Experiments)] The transfer results (+2.2% math, etc.) are reported without specifying the evaluation benchmarks or whether the same quality-aware training was used; add precise cross-task evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the manuscript. We provide point-by-point responses to the major comments and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §3 (Methods)] Abstract and §3 (Methods): The central numerical claims (93.1% SR / 46.6% QR, 28.8% gain from quality-aware rewards, and superiority over GPT-4o) rest on the Quality Ratio metric. For NP-hard problems where exact optima are unavailable on the 1,000-instance test set, the manuscript must demonstrate that the quality verifiers correlate with true optimality. Validation against known optima on smaller, solvable instances (via exhaustive search or exact solvers) is required; without it, QR and the reported gains risk reflecting verifier heuristics rather than genuine optimization improvement. This is load-bearing for the primary results.

Authors: We agree that validating the quality verifiers' correlation with true optimality is essential to substantiate the Quality Ratio metric and the reported improvements. The original manuscript describes the quality verifiers and optimal baselines but does not include explicit validation experiments on small instances. In the revised manuscript, we will add a validation study in Section 3, where we generate small instances solvable by exact methods (e.g., dynamic programming or solvers for subsets of tasks), compute true optima, and report correlation coefficients (such as Spearman rank correlation) between verifier scores and true quality. This will confirm that QR reflects genuine optimization progress. revision: yes
Referee: [§4 (Results) and Table 1 (presumed)] §4 (Results) and Table 1 (presumed): No error bars, standard deviations, or statistical tests accompany the reported SR and QR values or the 28.8% improvement. Given the stochasticity of LLM sampling and RL training, multiple independent runs are needed to establish that the outperformance over GPT-4o and the reward-type ablation are reliable rather than run-specific.

Authors: We recognize the need for statistical rigor in reporting results from stochastic processes like LLM sampling and RL training. The presented results are from single training runs, which limits the assessment of variability. To address this, we will conduct multiple independent runs with different random seeds for the main experiments in the revised version. We will update Table 1 and the results section to include mean values with standard deviations or error bars, and perform statistical tests (e.g., t-tests) to assess the significance of the differences over GPT-4o and between reward types. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks and verifiers rather than self-referential definitions or fitted predictions

full rationale

The paper presents an empirical RLVR training pipeline on OPT-BENCH, reporting measured Success Rate and Quality Ratio on a held-out 1,000-instance benchmark after training on 15K generated examples. No equations, derivations, or first-principles claims appear in the provided text that reduce the reported performance numbers (93.1% SR, 46.6% QR, 28.8% improvement) to quantities defined from the same fitted outputs or self-citations. The quality verifiers and optimal baselines are described as external components of the benchmark infrastructure; the superiority claims are statistical comparisons against GPT-4o on fixed test instances, not algebraic identities or renamings of inputs. This is the normal case of a self-contained empirical study whose central numbers are falsifiable against the benchmark rather than forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework extends standard RLVR with quality scoring but does not introduce new mathematical objects or fitted constants beyond the usual RL hyperparameters.

pith-pipeline@v0.9.0 · 5594 in / 1340 out tokens · 37927 ms · 2026-05-12T02:36:01.520363+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Roptimal = Ms/Mh (maximization) or Mh/Ms (minimization); QR measures solution quality relative to heuristic baselines
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FORGE-BENCH evaluates Success Rate and Quality Ratio on 10 NP-hard tasks with heuristic solvers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Finereason: Evaluating and improving llms’ deliberate reasoning through reflective puzzle solving

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving , author=. arXiv preprint arXiv:2502.20238 , year=

work page arXiv
[3]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

work page 2024
[4]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

What, how, where, and how well? a survey on test-time scaling in large language models , author=. arXiv preprint arXiv:2503.24235 , year=

work page internal anchor Pith review arXiv
[5]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[6]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2025 , eprint=

FCoReBench: Can Large Language Models Solve Challenging First-Order Combinatorial Reasoning Problems? , author=. 2025 , eprint=

work page 2025
[8]

arXiv preprint arXiv:2503.10460 , year=

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond , author=. arXiv preprint arXiv:2503.10460 , year=

work page arXiv
[9]

Greg Kamradt , title =

work page
[10]

Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =

work page
[11]

arXiv preprint arXiv:2406.12172 , year=

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems , author=. arXiv preprint arXiv:2406.12172 , year=

work page arXiv
[12]

On the Measure of Intelligence

On the measure of intelligence , author=. arXiv preprint arXiv:1911.01547 , year=

work page internal anchor Pith review arXiv 1911
[13]

Zebralogic: On the scaling limits of llms for logical reasoning

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning , author=. arXiv preprint arXiv:2502.01100 , year=

work page arXiv
[14]

Phd knowledge not required: A reasoning challenge for large language models

Phd knowledge not required: A reasoning challenge for large language models , author=. arXiv preprint arXiv:2502.01584 , year=

work page arXiv
[15]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks , author=. arXiv preprint arXiv:2410.06526 , year=

work page arXiv
[16]

Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models , author=. arXiv preprint arXiv:2502.17387 , year=

work page arXiv
[17]

arXiv preprint arXiv:2504.11456 , year=

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page arXiv
[18]

arXiv preprint arXiv:2505.00551 , year=

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models , author=. arXiv preprint arXiv:2505.00551 , year=

work page arXiv
[19]

Exploring data scaling trends and effects in reinforcement learning from human feedback

Exploring data scaling trends and effects in reinforcement learning from human feedback , author=. arXiv preprint arXiv:2503.22230 , year=

work page arXiv
[20]

Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding , author=. arXiv preprint arXiv:2503.02951 , year=

work page arXiv
[21]

arXiv preprint arXiv:2502.14768 , year=

Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning , author=. arXiv preprint arXiv:2502.14768 , year=

work page arXiv
[22]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms , author=. arXiv preprint arXiv:2501.12599 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review arXiv
[24]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Reinforcement learning for reasoning in large language models with one training example, 2025

Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. arXiv preprint arXiv:2504.20571 , year=

work page arXiv
[26]

2024 , url =

Learning to reason with LLMs , author=. 2024 , url =

work page 2024
[27]

QwQ-32B: Embracing the Power of Reinforcement Learning , url =

Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

work page
[28]

Bringing Grok to Everyone , url =

x.ai , year =. Bringing Grok to Everyone , url =

work page
[29]

arXiv preprint arXiv:2312.14925

A survey of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2312.14925 , volume=

work page arXiv
[30]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret , author=. arXiv preprint arXiv:2503.01491 , year=

work page arXiv
[32]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[33]

2025 , url =

Claude 3.7 Sonnet and Claude Code , author=. 2025 , url =

work page 2025
[34]

2025 , url =

Gemini 2.5: Our most intelligent AI model , author=. 2025 , url =

work page 2025
[35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,

Code-r1: Reproducing r1 for code with reliable rewards , author=. arXiv preprint arXiv:2503.18470 , year=

work page arXiv
[38]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author=. arXiv preprint arXiv:2503.24290 , year=

work page internal anchor Pith review arXiv
[39]

5: Advancing superb reasoning models with reinforcement learning , author=

Seed-thinking-v1. 5: Advancing superb reasoning models with reinforcement learning , author=. arXiv preprint arXiv:2504.13914 , year=

work page arXiv
[40]

FOLIO: Natural Language Reasoning with First-Order Logic

Folio: Natural language reasoning with first-order logic , author=. arXiv preprint arXiv:2209.00840 , year=

work page arXiv
[41]

arXiv preprint arXiv:2112.05742 , year=

A Puzzle-Based Dataset for Natural Language Inference , author=. arXiv preprint arXiv:2112.05742 , year=

work page arXiv
[42]

Transactions on Machine Learning Research , issn=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , issn=. 2023 , url=

work page 2023
[43]

Assessing and enhancing the robustness of large language models with task structure variations for logical reasoning

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning , author=. arXiv preprint arXiv:2310.09430 , year=

work page arXiv
[44]

Large language models are not strong abstract reasoners

Gendron, Ga\". Large language models are not strong abstract reasoners , year =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , articleno =. doi:10.24963/ijcai.2024/693 , abstract =

work page doi:10.24963/ijcai.2024/693 2024
[45]

2025 , eprint=

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles , author=. 2025 , eprint=

work page 2025
[46]

62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 , pages=

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes , author=. 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 , pages=. 2024 , organization=

work page 2024
[47]

Pokorny, Xiao Huang, and Xinrun Wang

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs , author=. arXiv preprint arXiv:2504.11239 , year=

work page arXiv
[48]

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models, 2025

Formalmath: Benchmarking formal mathematical reasoning of large language models , author=. arXiv preprint arXiv:2505.02735 , year=

work page arXiv
[49]

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai , author=. arXiv preprint arXiv:2411.04872 , year=

work page arXiv
[50]

arXiv preprint arXiv:2506.04894 , year=

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests , author=. arXiv preprint arXiv:2506.04894 , year=

work page arXiv
[51]

arXiv preprint arXiv:2506.10764 , year=

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems , author=. arXiv preprint arXiv:2506.10764 , year=

work page arXiv
[52]

2025 , eprint=

Qwen2.5-1M Technical Report , author=. 2025 , eprint=

work page 2025
[53]

2025 , eprint=

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling , author=. 2025 , eprint=

work page 2025
[54]

OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

work page
[55]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

work page 2025
[56]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[57]

2025 , eprint=

Group Sequence Policy Optimization , author=. 2025 , eprint=

work page 2025
[58]

2025 , eprint=

GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning , author=. 2025 , eprint=

work page 2025
[59]

2026 , eprint=

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks , author=. 2026 , eprint=

work page 2026
[60]

2026 , eprint=

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification , author=. 2026 , eprint=

work page 2026
[61]

2026 , eprint=

Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic , author=. 2026 , eprint=

work page 2026
[62]

2026 , eprint=

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning , author=. 2026 , eprint=

work page 2026