{"total":31,"items":[{"citing_arxiv_id":"2605.19723","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges","primary_cat":"cs.CL","submitted_at":"2026-05-19T11:56:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19425","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17672","ref_index":47,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-05-17T22:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16727","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play","primary_cat":"cs.AI","submitted_at":"2026-05-16T00:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15464","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero","primary_cat":"cs.LG","submitted_at":"2026-05-14T23:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15224","ref_index":38,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ICRL: Learning to Internalize Self-Critique with Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-13T08:50:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12058","ref_index":11,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Holder Policy Optimisation","primary_cat":"cs.LG","submitted_at":"2026-05-12T12:45:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"For instance, Group Relative Policy Optimisation (GRPO), Geometric- Mean Policy Optimisation (GMPO) [Zhao et al., 2025] and Dynamic Sampling Policy Optimisation (DAPO) [Yu et al., 2025]. With token-level clipping, the objective function (7) of our Hölder-MPO becomes JHt(θ) =E x∼D,{yi}G i=1∼πθold (·|x)   1 G  X bAi>0 Ci,pbAi + X bAi<0 Di,pbAi     ,(11) Ci,p =   1 |yi| |yi|X t=1 min (ri,t(θ),clip (r i,t(θ),1−ε,1 +ε)) p   1/p , Di,p =   1 |yi| |yi|X t=1 max (ri,t(θ),clip (r i,t(θ),1−ε,1 +ε)) p   1/p , where the clipping function is defined by clip(x,1−ϵ,1 +ϵ) :=    1−ϵ,ifx <1−ϵ x,if1−ϵ≤x≤1 +ϵ 1 +ϵ,ifx >1 +ϵ. (12) To deduce this formula, we firstly recall the token-level clipping GRPO objective function in [Shao"},{"citing_arxiv_id":"2605.11666","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-12T07:26:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoTD applies crossover for skill composition and parametric mutation for complexity scaling, filtered by a Zone of Proximal Development, to generate tasks that improve LLM reasoning generalization across models.","context_count":1,"top_context_role":"method","top_context_polarity":"unclear","context_text":"3deff(arr:list[int], k:int, m:int): 4n =len(arr) 5ifk <= 0ork > n: 6return-1 7sums = [] 8window_sum =sum(arr[:k]) 9sums.append(window_sum) 10foriin range(k, n): 11window_sum += arr[i] - arr[i - k] 12sums.append(window_sum) 1313# === MUTATION START: Max-Window Phase === 14m1 =len(sums) 15ifm <= 0orm > m1: 16return-1 17maxes = [] 18dq = deque() 19foriin range(m1): 20whiledqandsums[dq[-1]] <= sums[i]: 21dq.pop() 22dq.append(i) 23ifdq[0] <= i - m: 24dq.popleft() 25ifi >= m - 1: 26maxes.append(sums[dq[0]]) 27# === MUTATION END === 28 28L =min(len(sums),len(maxes)) 29diffs = [maxes[i] - sums[i]foriin range(L)] 30min_diff = diffs[0] 31idx = 0 32foriin range(1, L): 33ifdiffs[i] < min_diff: 34min_diff = diffs[i] 35idx = i 36returnidx 37"},{"citing_arxiv_id":"2605.11505","ref_index":42,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Selective Off-Policy Reference Tuning with Plan Guidance","primary_cat":"cs.AI","submitted_at":"2026-05-12T04:25:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09292","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:38:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Annual Meeting of the Association for Computational Linguistics, pages 158-167, 2017. doi: 10.18653/v1/P17-1015. [11] Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024. [12] Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association"},{"citing_arxiv_id":"2605.07501","ref_index":14,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression","primary_cat":"cs.LG","submitted_at":"2026-05-08T09:37:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpThink reduces average CoT response length by up to 77% while improving accuracy on math benchmarks via experience-guided reward shaping and difficulty-adaptive advantage in RL.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Training is conducted on the DeepScaleR dataset [ 31], which contains approximately 40K mathematical problems with calibrated difficulty levels. We set α= 0.1 and rpen = 0.5 as default; full hyperparameters are provided in Appendix B. Evaluation benchmarks.We evaluate on five standard mathematical reasoning benchmarks: AIME24 [6], AMC23 [ 33], MATH-500 [26], Minerva Math [ 20], and OlympiadBench [ 14]. To further assess out-of-domain generalization, we additionally evaluate on LiveCodeBench [19] for code reasoning, GPQA-Diamond [38] for graduate-level scientific question answering, and MMLU [15] for broad multitask language understanding. Detailed descriptions of all benchmarks are provided in Appendix C. Evaluation protocol.We report Pass@1 accuracy, average response token length, and Intelligence"},{"citing_arxiv_id":"2605.02028","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Language models fail at extended rule following","primary_cat":"cs.CL","submitted_at":"2026-05-03T19:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"id=WE_vluYUL-X. 10. T. Schick,et al., Toolformer: Language Models Can Teach Themselves to Use Tools (2023), https://proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79ba039352da0f658a906-Abstract-Conference.html. 15 11. R. Lupoiu,et al., A multi-agentic framework for real-time, autonomous freeform metasurface design.Science Advances11(44), eadx8006 (2025), doi:10.1126/sciadv.adx8006, https: //www.science.org/doi/abs/10.1126/sciadv.adx8006. 12. S. Chen,et al., Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation (2025), doi:10.18653/v1/2025.emnlp-main.511, https://aclanthology.org/2025.emnlp-main.511/. 13. C. Snell, J. Lee, K. Xu, A."},{"citing_arxiv_id":"2605.00674","ref_index":21,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-01T13:56:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MathArena is broadened into a maintained platform with new benchmarks for proofs, research questions, and formal verification, where GPT-5.5 scores 98% on 2026 USAMO and 74% on research-level tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23333","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Process Supervision of Confidence Margin for Calibrated LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-25T14:40:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21510","ref_index":126,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving","primary_cat":"cs.CL","submitted_at":"2026-04-23T10:12:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20183","ref_index":110,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving","primary_cat":"cs.CL","submitted_at":"2026-04-22T04:55:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"from linear constraint expressions or the objective function. 2. Verifying bounds and types to prevent infeasibility …… Modeling Cluster Example Coding Cluster Example Pitfall: 1. Do not proceed without validating solver initialization …… Figure 3: Two examples in our Dual-Cluster Memory. N= 5), we trigger a knowledge update step: K(t+1) =LLM synth  K(t) ∪ N[ j=1 Φnj   (2) Here, LLMsynth is used to abstract generalized pat- terns from the new batch of instance knowledge Φnj and merge them into the generalized knowl- edge K(t). This ensures that K evolves to capture robust, non-redundant insights while retaining spe- cific pitfall warnings, and is not overly influenced by extreme samples, as shown in Figure 3. Bipartite Graph Construction.At the same time, we introduce a bipartite graph G to model the associations between the these decoupled clus- ters. Since each experience node n naturally maps to a pair of clusters (C M i , CC j ), these linkages ag- gregate into a global structure. We formalize this as a bipartite graph G= (V M , VC, E), where the edge weight wij quantifies the co-occurrence fre- quency of modeling logic CM i and coding strategy CC j . The strong edges represent proven pathways, providing critical priors for subsequent usage. 3.3 Memory-Augmented Inference 3.3.1 Dual-Retrieval For a new problemxnew, DCM-Agent leverages the memory to efficiently navigate the solution space by retrieving relevant historical experiences. We first encode the problem into the modeling logic embedding enew and employ two complementary retrieval mechanisms to balance the problem rele- vance with general algorithmic applicability: Instance-Level Retrievalcaptures the granular problem similarity by retrieving specific nodes H closest to enew, thereby identifying relevant experi- ence nodes that share detailed semantic features: H= arg max K {sim(enew,e i)|x i ∈ D}.(3) Cluster-Level Retrievaltargets abstract patterns by comparing enew directly with cluster ce"},{"citing_arxiv_id":"2604.18530","ref_index":45,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:26:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17957","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards","primary_cat":"cs.CL","submitted_at":"2026-04-20T08:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17304","ref_index":56,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Efficient Test-Time Scaling via Temporal Reasoning Aggregation","primary_cat":"cs.AI","submitted_at":"2026-04-19T07:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17293","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond \"I Don't Know\": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty","primary_cat":"cs.CL","submitted_at":"2026-04-19T07:15:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11297","ref_index":48,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping","primary_cat":"cs.LG","submitted_at":"2026-04-13T10:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06465","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Multi-objective Evolutionary Merging Enables Efficient Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-04-07T21:07:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evo-L2S uses multi-objective evolutionary model merging to produce reasoning models that cut generated chain-of-thought length by over 50% while preserving or improving accuracy on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03893","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-04T23:18:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Computer Physics Communications140, 3 (2001), 418-431. arXiv:hep- ph/0012260 doi:10.1016/S0010-4655(01)00290-9 [20] Koji Hashimoto, Yuji Hirono, Jun Maeda, and Jojiro Totsuka-Yoshinaka. 2024. Neural network representation of quantum systems.Machine Learning: Science and Technology5, 4 (2024), 045039. arXiv:2403.11420 [hep-th] doi:10.1088/2632- 2153/ad81ac [21] Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics"},{"citing_arxiv_id":"2512.07461","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-12-08T11:39:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.05591","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-12-05T10:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13755","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration","primary_cat":"cs.LG","submitted_at":"2025-08-19T11:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15778","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR","primary_cat":"cs.CL","submitted_at":"2025-07-21T16:34:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.01679","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling","primary_cat":"cs.LG","submitted_at":"2025-07-02T13:04:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15134","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-21T05:39:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":243,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"SWEbench [327], CodeContests [427], and LiveCodeBench [309] evaluating LLM coding and problem-solving skills. Notable additions such as MHPP [148], ProBench [934], HumanEval Pro, MBPP Pro [993], and EquiBench [833] enhance the scope and complexity of coding challenges. Moreover, some studies have explored applying these benchmarks in real-world code development scenarios for automatic code generation and evaluation [243, 744]. • Commonsense Puzzle: Commonsense puzzle benchmarks, including LiveBench [ 850], BIG- Bench Hard [705] and ZebraLogic [450], assess models' ability to reason about commonsense situations. The ARC [131] and DRE-Bench [947] is often viewed as a challenging commonsense- based AGI test. JustLogic [87] further contributes to the evaluation of deductive reasoning and"},{"citing_arxiv_id":"2502.01456","ref_index":139,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Process Reinforcement through Implicit Rewards","primary_cat":"cs.LG","submitted_at":"2025-02-03T15:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}