Recognition: no theorem link
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3
The pith
Storing past rollout features and clustering recurring errors lets dynamic penalties raise diversity and accuracy in language-model reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By storing intermediate model representations from previous rollouts and applying density-based clustering to detect frequently recurring error patterns, MEDS dynamically shapes rewards to penalize prevalent mistakes more heavily. This encourages broader exploration, reduces repeated erroneous behaviors, and yields higher average performance than standard baselines.
What carries the argument
MEDS (Memory-Enhanced Dynamic reward Shaping), which stores historical intermediate representations, clusters them to identify recurrent error patterns, and adjusts per-rollout rewards accordingly.
If this is right
- Across five datasets and three base models, MEDS raises pass@1 by up to 4.13 points and pass@128 by up to 4.37 points over existing methods.
- Behavioral diversity rises during sampling, confirmed by both LLM annotations and quantitative metrics.
- Rollouts matching common error clusters receive heavier penalties, which the method claims directly reduces looping on the same failures.
- The approach targets a failure mode that standard entropy regularization does not address explicitly.
Where Pith is reading between the lines
- The same memory-and-cluster idea could be tested in other sequential decision settings where policies repeat suboptimal actions.
- If the stored representations capture task-relevant features, the method might reduce reliance on hand-crafted reward terms in future RL setups.
- Extending the memory window or trying different clustering thresholds could be checked to see whether longer history improves or harms results.
Load-bearing premise
Density-based clustering on stored intermediate representations will correctly group and flag detrimental recurrent error patterns so that extra penalties on them produce useful exploration instead of suppressing valid answer variations.
What would settle it
Running the same training loops without the clustering step or with randomly assigned penalties and finding no drop in diversity metrics or performance would show that the targeted identification of error patterns is not what drives the gains.
read the original abstract
Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MEDS, a Memory-Enhanced Dynamic reward Shaping method for RL in LLMs. It stores intermediate representations from past rollouts, applies density-based clustering to detect frequently recurring patterns (interpreted as errors), and applies heavier penalties to rollouts in denser clusters. This is intended to reduce repetitive mistakes and increase exploration beyond standard entropy regularization. Experiments across five datasets and three base models report consistent gains (up to 4.13 pass@1 and 4.37 pass@128) plus improved diversity metrics from LLM annotations and quantitative measures.
Significance. If the core mechanism reliably penalizes detrimental recurrent errors rather than common valid behaviors, MEDS would offer a practical extension to reward shaping that directly targets historical failure modes in LLM sampling. The multi-dataset, multi-model evaluation and dual diversity analyses (qualitative and quantitative) provide a reasonable basis for claiming broader applicability, though verification of the error-identification assumption is required for the result to be load-bearing.
major comments (2)
- [Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.
- [Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.
minor comments (2)
- [Experiments] The abstract and results mention 'LLM-based annotations' for diversity but provide no details on the annotation prompt, model used, or inter-annotator agreement; this should be clarified for reproducibility.
- [Method] Notation for the penalty scaling factor and clustering hyperparameters is introduced without explicit equations or pseudocode; adding a short algorithm box would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, acknowledging where the manuscript can be strengthened through revisions and providing clarifications on the methodological assumptions and experimental reporting.
read point-by-point responses
-
Referee: [Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.
Authors: We agree that the method relies on the assumption that dense clusters in the stored representations primarily capture recurrent error patterns rather than common correct behaviors. This interpretation is motivated by the nature of the tasks (e.g., code generation), where repeated failures often manifest as similar intermediate representations, while successful solutions tend to be more diverse. However, we acknowledge that this assumption was not explicitly verified in the original submission. In the revision, we will add a new analysis subsection that evaluates cluster purity by computing the average success rate (using ground-truth labels) for rollouts assigned to each cluster. We will also report the proportion of successful rollouts excluded or down-weighted and discuss cases where clusters contain mixed outcomes. This will provide empirical grounding for the error-identification claim and allow readers to assess the assumption directly. revision: yes
-
Referee: [Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.
Authors: We accept that the experimental section requires additional rigor for reproducibility and to isolate the contribution of MEDS. In the revised version, we will include: (1) statistical significance tests (paired t-tests across 5 random seeds) for all reported pass@k improvements; (2) complete hyperparameter tables for baselines and MEDS, including learning rates, entropy coefficients, memory buffer sizes, and clustering parameters (eps and min_samples for DBSCAN); (3) implementation details for the three base models, specifying exact model checkpoints, LoRA configurations, and training hardware; and (4) a new subsection with compute/memory measurements showing that the overhead of representation storage and clustering is under 5% of total training time, plus an ablation that disables the density-based penalty while retaining the memory buffer to control for incidental regularization. These additions will strengthen attribution of the gains to the dynamic shaping mechanism. revision: yes
Circularity Check
No circularity; empirical framework with independent clustering step
full rationale
The paper describes MEDS as storing intermediate representations from rollouts, applying density-based clustering to identify recurring patterns, and penalizing denser clusters to encourage exploration. No equations, derivations, or self-citations are shown that reduce the claimed performance gains (e.g., pass@1 improvements) to a quantity defined in terms of itself or fitted directly to the target metric. The central mechanism relies on an external clustering procedure applied to stored features rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. Experimental results across datasets and models provide independent validation, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- clustering hyperparameters
- penalty scaling factor
axioms (2)
- domain assumption Intermediate model representations encode distinguishable features of behavioral error patterns across rollouts.
- domain assumption Penalizing rollouts in high-density error clusters promotes broader exploration without harming overall learning.
Reference graph
Works this paper leans on
-
[1]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
doi: 10.48550/ARXIV.2402.03300. URLhttps://doi.org/10.48550/arXiv.2402.03300
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
-
[3]
Stepcoder: Improve code generation with reinforcement learning from compiler feedback
Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. Stepcoder: Improve code generation with reinforcement learning from compiler feedback.CoRR, abs/2402.01391, 2024. doi: 10.48550/ARXIV.2402.01391. URLhtt...
-
[4]
Execution-basedcodegenerationusingdeep reinforcement learning.Trans
ParshinShojaee,AneeshJain,SindhuTipirneni,andChandanK.Reddy. Execution-basedcodegenerationusingdeep reinforcement learning.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=0XBuaxqEcG
2023
-
[5]
Christiano, Jan Leike, and Ryan Lowe
LongOuyang,JeffreyWu,XuJiang,DiogoAlmeida,CarrollL.Wainwright,PamelaMishkin,ChongZhang,Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi...
2022
-
[6]
Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J
GokulSwamy,SanjibanChoudhury,WenSun,ZhiweiStevenWu,andJ.AndrewBagnell. Allroadsleadtolikelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067
-
[7]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, XiangpengWei,HaoZhou,JingjingLiu,Wei-Yin...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
-
[8]
Abhijeet Sinha, Sundari Elango, and Dianbo Liu. Expected return causes outcome-level mode collapse in re- inforcement learning and how to fix it with inverse probability scaling.CoRR, abs/2601.21669, 2026. doi: 10.48550/ARXIV.2601.21669. URLhttps://doi.org/10.48550/arXiv.2601.21669
-
[9]
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
Liang Chen, Xueting Han, Qizhou Wang ands Bo Han, Jing Bai, Hinrich Schutze, and Kam-Fai Wong. EEPO: exploration-enhanced policy optimization via sample-then-forget.CoRR, abs/2510.05837, 2025. doi: 10.48550/ ARXIV.2510.05837. URLhttps://doi.org/10.48550/arXiv.2510.05837
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.05837 2025
-
[10]
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning.CoRR, abs/2506.01347, 2025. doi: 10.48550/ARXIV.2506.01347. URL https://doi.org/10.48550/arXiv.2506.01347
-
[11]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of...
2018
-
[12]
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026. 11
-
[13]
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, J...
2016
-
[14]
Clay Holroyd and Michael Coles. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.Psychological Review, 109:679–709, 11 2002. doi: 10.1037/0033-295X.109.4.679
-
[15]
URLhttps://doi.org/10.21105/joss.00205
Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.J. Open Source Softw., 2(11):205, 2017. doi: 10.21105/JOSS.00205. URLhttps://doi.org/10.21105/joss.00205
-
[16]
OpenAI. Openai o1 system card.CoRR, abs/2412.16720, 2024. doi: 10.48550/ARXIV.2412.16720. URLhttps: //doi.org/10.48550/arXiv.2412.16720
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
-
[17]
Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183,
YaoyuZhu,DiHuang,HanqiLyu,XiaoyunZhang,ChongxiaoLi,WenxuanShi,YutongWu,JiananMu,JinghuaWang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, andYunjiChen. Codev-r1: Reasoning-enhancedveriloggeneration,2025. URL https://arxiv.org/abs/2505.24183
-
[18]
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.CoRR, abs/2508.04416, 2025. doi: 10.48550/ARXIV.2508.04416. URLhttps://doi.org/10.48550/arXiv. 2508.04416
-
[19]
Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback. arXiv preprint arXiv:2507.20766, 2025
-
[20]
REARANK: reasoning re-ranking agent via reinforcement learning
Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. REARANK: reasoning re-ranking agent via reinforcement learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, p...
-
[21]
Let’s verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps: //openreview.net/forum?id=v8L0pN6EOi
2024
-
[22]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Y. Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.CoRR, abs/2211.14275, 2022. doi: 10.48550/ARXIV.2211.14275. URLhttps://doi.org/10.48550/arXiv.2211.14275
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275 2022
-
[23]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms
Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware RL for creative problem solving in llms. CoRR, abs/2601.08763, 2026. doi: 10.48550/ARXIV.2601.08763. URLhttps://doi.org/10.48550/arXiv.2601.08763
-
[25]
Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025
Yuda Song, Julia Kempe, and Rémi Munos. Outcome-based exploration for LLM reasoning.CoRR, abs/2509.06941,
-
[26]
Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025
doi: 10.48550/ARXIV.2509.06941. URLhttps://doi.org/10.48550/arXiv.2509.06941
-
[27]
Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie Zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, and JifengDai. Automc-reward: Automateddenserewarddesignwithlargelanguagemodelsforminecraft. InIEEE/CVF ConferenceonComputerVisionandPatternRecognition,CVPR2024,Seattle,WA,USA,June16-22,2024,pages16426–16435. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01554....
-
[28]
Multi-objective evolution of heuristic usinglargelanguagemodel
Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic usinglargelanguagemodel. InTobyWalsh, JulieShah, andZicoKolter, editors,AAAI-25, SponsoredbytheAssociation for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 27144–27152. AAAI Press, 2025. d...
-
[29]
Latent reward: Llm-empowered credit assignment in episodic reinforcement learning
Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, and Xiangyang Ji. Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...
-
[30]
Revolve: Reward evolution with large language models using human feedback
Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, and Pedro Zuidberg Dos Martires. Revolve: Reward evolution with large language models using human feedback. InThe Thirteenth International Conference on Learning Representations,ICLR2025,Singapore,April24-28,2025.OpenReview.net,2025. URL https://openreview.net/forum? id=cJPUpL8mOw
2025
-
[31]
Eureka: Human-level reward design via coding large language models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe TwelfthInternationalConferenceonLearningRepresentations,ICLR2024,Vienna,Austria,May7-11,2024.OpenReview.net,
2024
-
[32]
URLhttps://openreview.net/forum?id=IEduRUO55F
-
[33]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...
2022
-
[34]
Daniel Freeman, Theodore R
AdlyTempleton,TomConerly,JonathanMarcus,JackLindsey,TrentonBricken,BrianChen,AdamPearce,CraigCitro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extrac...
2024
-
[35]
In-context learning and induction heads.Transformer Circuits Thread, 2022
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
2022
-
[36]
International Conference on Learning Representations (ICLR) , year=
Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition.CoRR, abs/2504.20938, 2025. doi: 10.48550/ARXIV.2504.20938. URLhttps://doi.org/10.48550/arXiv.2504.20938
-
[37]
Stefan Heimersheim and Neel Nanda
Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.CoRR, abs/2410.20526, 2024. doi: 10.48550/ARXIV.2410.20526. URL https://doi.org/10.48550/arXiv.2410.20526
-
[38]
Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025
ZhengZhao, YeskendirKoishekenov, XianjunYang, NailaMurray, andNicolaCancedda. Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025. doi: 10.48550/ARXIV.2510.09312. URL https://doi.org/10.48550/arXiv.2510.09312
-
[39]
Bottom-up policy optimization: Your language model policy secretly contains internal policies
Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies.CoRR, abs/2512.19673, 2025. doi: 10.48550/ARXIV.2512.19673. URLhttps://doi.org/10.48550/arXiv.2512.19673
-
[40]
Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.CoRR, abs/2601.07372, 2026. doi: 10.48550/ARXIV.2601.07372. URLhttps://doi.org/10.48...
-
[41]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021...
2021
-
[42]
Understanding R1-Zero-Like Training: A Critical Perspective
ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783, 2025. doi: 10.48550/ARXIV.2503.20783. URL https://doi.org/10.48550/arXiv.2503.20783
-
[43]
Hybridflow: A flexible and efficient RLHF framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM,
2025
-
[44]
doi: 10.1145/3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075
-
[45]
Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025
DaixuanCheng, ShaohanHuang, XuekaiZhu, BoDai, WayneXinZhao, ZhenliangZhang, andFuruWei. Reasoning with exploration: An entropy perspective.ArXiv, abs/2506.14758, 2025. URLhttps://api.semanticscholar.org/ CorpusID:279410424
-
[46]
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
2024
-
[47]
Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...
2022
-
[48]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedi...
-
[49]
1 𝐾 𝐾Õ 𝑖=1 min 𝑟𝑖(𝜃)𝐴𝑖 ,clip(𝑟𝑖(𝜃),1−𝜖,1+𝜖)𝐴𝑖 !# −𝛽𝔻KL[𝜋𝜃∥𝜋ref] (1) ℒDAPO(𝜃)=𝔼𝑞∼𝒟
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Informa- tion Processing Systems 36: Annual Confer...
2023
-
[50]
Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer
Thus, 89 is closer and we need to check it more precisely: 892 =7921and90 2 =8100. Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer. Let’s check the closest integer values. We need𝑝 to be an integer and prime. Let’s check if 89 is a solution by substituting it back: 𝑝2+𝑝−2021=892...
2021
-
[51]
The factors of𝑝2 are1, 𝑝, and𝑝 2
Understanding the Factors:For a number𝑛 to have exactly three distinct positive factors, the most common form of such a number is𝑝2, where𝑝is a prime number. The factors of𝑝2 are1, 𝑝, and𝑝 2. So, if𝑛=𝑝 2, the sum of the factors1+𝑝+𝑝 2 =2022. Therefore, we have: 𝑝2+𝑝+1=2022=⇒𝑝 2+𝑝−2021=0 21
2022
-
[52]
"" Returns a sorted list of all divisors of n
Solving the Quadratic Equation:We solve for𝑝 using the quadratic formula𝑝= −𝑏± √ 𝑏2−4𝑎𝑐 2𝑎 where 𝑎=1, 𝑏=1 , and𝑐=−2021: 𝑝= −1± √ 1+4·2021 2 =−1± √ 8085 2 Since 𝑝 is a prime number, we need to check if √ 8085 is an integer and if−1+ √ 8085 2 is a prime number. We find that√ 8085=89.917. This means the closest integer value for the square root is 89. Checki...
2021
-
[53]
The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022
Identify the Structure of Divisors: • For a number𝑛 to have three distinct divisors, they must be in the form1, 𝑑, 𝑛 𝑑 where 𝑑 is a divisor of𝑛 other than 1 and𝑛. The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022. •This equation can be rearranged to:1+𝑑+ 𝑛 𝑑 =2022or𝑑+ 𝑛 𝑑 =2021. Thus,𝑛=𝑑(2021−𝑑)
2022
-
[54]
reasoning path
Find the Smallest𝑛: • We need to find pairs(𝑑,𝑛 𝑑)such that their sum is 2021 and𝑑≠1, 𝑛 𝑑. Both 𝑑 and 𝑛 𝑑 must be integers, so𝑑 must be a divisor of𝑛and2021−𝑑must be an integer greater than𝑑(because𝑑≠𝑛 𝑑). •The smallest possible𝑑greater than 1 is 2, but let’s check all possible values systematically. We’ll iterate over possible divisors𝑑 that are less tha...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.