pith. machine review for the scientific record. sign in

arxiv: 2604.11297 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learninglarge language modelsreward shapingmemorysampling diversityerror patternsdynamic rewards
0
0 comments X

The pith

Storing past rollout features and clustering recurring errors lets dynamic penalties raise diversity and accuracy in language-model reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MEDS to fix a common problem in reinforcement learning for large language models: policies that keep producing the same mistakes across many attempts. It stores intermediate representations from earlier rollouts, runs density-based clustering on those features to find which error patterns appear most often, and then applies stronger penalties to rollouts that match the popular error clusters. This memory-based adjustment supplements ordinary entropy terms by explicitly discouraging repeated failures rather than just adding generic randomness. A reader would care because higher diversity in sampling can translate into better final performance on tasks that require exploring many possible answers. The reported results show consistent gains across multiple datasets and base models when this shaping is used.

Core claim

By storing intermediate model representations from previous rollouts and applying density-based clustering to detect frequently recurring error patterns, MEDS dynamically shapes rewards to penalize prevalent mistakes more heavily. This encourages broader exploration, reduces repeated erroneous behaviors, and yields higher average performance than standard baselines.

What carries the argument

MEDS (Memory-Enhanced Dynamic reward Shaping), which stores historical intermediate representations, clusters them to identify recurrent error patterns, and adjusts per-rollout rewards accordingly.

If this is right

  • Across five datasets and three base models, MEDS raises pass@1 by up to 4.13 points and pass@128 by up to 4.37 points over existing methods.
  • Behavioral diversity rises during sampling, confirmed by both LLM annotations and quantitative metrics.
  • Rollouts matching common error clusters receive heavier penalties, which the method claims directly reduces looping on the same failures.
  • The approach targets a failure mode that standard entropy regularization does not address explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-and-cluster idea could be tested in other sequential decision settings where policies repeat suboptimal actions.
  • If the stored representations capture task-relevant features, the method might reduce reliance on hand-crafted reward terms in future RL setups.
  • Extending the memory window or trying different clustering thresholds could be checked to see whether longer history improves or harms results.

Load-bearing premise

Density-based clustering on stored intermediate representations will correctly group and flag detrimental recurrent error patterns so that extra penalties on them produce useful exploration instead of suppressing valid answer variations.

What would settle it

Running the same training loops without the clustering step or with randomly assigned penalties and finding no drop in diversity metrics or performance would show that the targeted identification of error patterns is not what drives the gains.

read the original abstract

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MEDS, a Memory-Enhanced Dynamic reward Shaping method for RL in LLMs. It stores intermediate representations from past rollouts, applies density-based clustering to detect frequently recurring patterns (interpreted as errors), and applies heavier penalties to rollouts in denser clusters. This is intended to reduce repetitive mistakes and increase exploration beyond standard entropy regularization. Experiments across five datasets and three base models report consistent gains (up to 4.13 pass@1 and 4.37 pass@128) plus improved diversity metrics from LLM annotations and quantitative measures.

Significance. If the core mechanism reliably penalizes detrimental recurrent errors rather than common valid behaviors, MEDS would offer a practical extension to reward shaping that directly targets historical failure modes in LLM sampling. The multi-dataset, multi-model evaluation and dual diversity analyses (qualitative and quantitative) provide a reasonable basis for claiming broader applicability, though verification of the error-identification assumption is required for the result to be load-bearing.

major comments (2)
  1. [Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.
  2. [Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.
minor comments (2)
  1. [Experiments] The abstract and results mention 'LLM-based annotations' for diversity but provide no details on the annotation prompt, model used, or inter-annotator agreement; this should be clarified for reproducibility.
  2. [Method] Notation for the penalty scaling factor and clustering hyperparameters is introduced without explicit equations or pseudocode; adding a short algorithm box would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, acknowledging where the manuscript can be strengthened through revisions and providing clarifications on the methodological assumptions and experimental reporting.

read point-by-point responses
  1. Referee: [Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.

    Authors: We agree that the method relies on the assumption that dense clusters in the stored representations primarily capture recurrent error patterns rather than common correct behaviors. This interpretation is motivated by the nature of the tasks (e.g., code generation), where repeated failures often manifest as similar intermediate representations, while successful solutions tend to be more diverse. However, we acknowledge that this assumption was not explicitly verified in the original submission. In the revision, we will add a new analysis subsection that evaluates cluster purity by computing the average success rate (using ground-truth labels) for rollouts assigned to each cluster. We will also report the proportion of successful rollouts excluded or down-weighted and discuss cases where clusters contain mixed outcomes. This will provide empirical grounding for the error-identification claim and allow readers to assess the assumption directly. revision: yes

  2. Referee: [Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.

    Authors: We accept that the experimental section requires additional rigor for reproducibility and to isolate the contribution of MEDS. In the revised version, we will include: (1) statistical significance tests (paired t-tests across 5 random seeds) for all reported pass@k improvements; (2) complete hyperparameter tables for baselines and MEDS, including learning rates, entropy coefficients, memory buffer sizes, and clustering parameters (eps and min_samples for DBSCAN); (3) implementation details for the three base models, specifying exact model checkpoints, LoRA configurations, and training hardware; and (4) a new subsection with compute/memory measurements showing that the overhead of representation storage and clustering is under 5% of total training time, plus an ablation that disables the density-based penalty while retaining the memory buffer to control for incidental regularization. These additions will strengthen attribution of the gains to the dynamic shaping mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with independent clustering step

full rationale

The paper describes MEDS as storing intermediate representations from rollouts, applying density-based clustering to identify recurring patterns, and penalizing denser clusters to encourage exploration. No equations, derivations, or self-citations are shown that reduce the claimed performance gains (e.g., pass@1 improvements) to a quantity defined in terms of itself or fitted directly to the target metric. The central mechanism relies on an external clustering procedure applied to stored features rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. Experimental results across datasets and models provide independent validation, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on unstated assumptions about representation quality and clustering validity without independent evidence in the provided abstract.

free parameters (2)
  • clustering hyperparameters
    Density-based clustering requires parameters such as neighborhood radius and minimum points per cluster that must be chosen or tuned to define error groups.
  • penalty scaling factor
    The strength with which prevalent clusters are penalized is not specified and likely requires selection to balance exploration and performance.
axioms (2)
  • domain assumption Intermediate model representations encode distinguishable features of behavioral error patterns across rollouts.
    Invoked implicitly when storing representations to enable clustering of recurring mistakes.
  • domain assumption Penalizing rollouts in high-density error clusters promotes broader exploration without harming overall learning.
    Central to the reward shaping logic but not justified in the abstract.

pith-pipeline@v0.9.0 · 5482 in / 1304 out tokens · 32457 ms · 2026-05-10T15:30:06.209126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 33 canonical work pages · 7 internal anchors

  1. [1]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

  2. [2]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    doi: 10.48550/ARXIV.2402.03300. URLhttps://doi.org/10.48550/arXiv.2402.03300

  3. [3]

    Stepcoder: Improve code generation with reinforcement learning from compiler feedback

    Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. Stepcoder: Improve code generation with reinforcement learning from compiler feedback.CoRR, abs/2402.01391, 2024. doi: 10.48550/ARXIV.2402.01391. URLhtt...

  4. [4]

    Execution-basedcodegenerationusingdeep reinforcement learning.Trans

    ParshinShojaee,AneeshJain,SindhuTipirneni,andChandanK.Reddy. Execution-basedcodegenerationusingdeep reinforcement learning.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=0XBuaxqEcG

  5. [5]

    Christiano, Jan Leike, and Ryan Lowe

    LongOuyang,JeffreyWu,XuJiang,DiogoAlmeida,CarrollL.Wainwright,PamelaMishkin,ChongZhang,Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi...

  6. [6]

    Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

    GokulSwamy,SanjibanChoudhury,WenSun,ZhiweiStevenWu,andJ.AndrewBagnell. Allroadsleadtolikelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067

  7. [7]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, XiangpengWei,HaoZhou,JingjingLiu,Wei-Yin...

  8. [8]

    Expected return causes outcome-level mode collapse in re- inforcement learning and how to fix it with inverse probability scaling.CoRR, abs/2601.21669, 2026

    Abhijeet Sinha, Sundari Elango, and Dianbo Liu. Expected return causes outcome-level mode collapse in re- inforcement learning and how to fix it with inverse probability scaling.CoRR, abs/2601.21669, 2026. doi: 10.48550/ARXIV.2601.21669. URLhttps://doi.org/10.48550/arXiv.2601.21669

  9. [9]

    EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

    Liang Chen, Xueting Han, Qizhou Wang ands Bo Han, Jing Bai, Hinrich Schutze, and Kam-Fai Wong. EEPO: exploration-enhanced policy optimization via sample-then-forget.CoRR, abs/2510.05837, 2025. doi: 10.48550/ ARXIV.2510.05837. URLhttps://doi.org/10.48550/arXiv.2510.05837

  10. [10]

    The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning.CoRR, abs/2506.01347, 2025. doi: 10.48550/ARXIV.2506.01347. URL https://doi.org/10.48550/arXiv.2506.01347

  11. [11]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of...

  12. [12]

    Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

    Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026. 11

  13. [13]

    Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

    Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, J...

  14. [14]

    The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.Psychological Review, 109:679–709, 11 2002

    Clay Holroyd and Michael Coles. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.Psychological Review, 109:679–709, 11 2002. doi: 10.1037/0033-295X.109.4.679

  15. [15]

    URLhttps://doi.org/10.21105/joss.00205

    Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.J. Open Source Softw., 2(11):205, 2017. doi: 10.21105/JOSS.00205. URLhttps://doi.org/10.21105/joss.00205

  16. [16]

    OpenAI o1 System Card

    OpenAI. Openai o1 system card.CoRR, abs/2412.16720, 2024. doi: 10.48550/ARXIV.2412.16720. URLhttps: //doi.org/10.48550/arXiv.2412.16720

  17. [17]

    Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183,

    YaoyuZhu,DiHuang,HanqiLyu,XiaoyunZhang,ChongxiaoLi,WenxuanShi,YutongWu,JiananMu,JinghuaWang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, andYunjiChen. Codev-r1: Reasoning-enhancedveriloggeneration,2025. URL https://arxiv.org/abs/2505.24183

  18. [18]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.CoRR, abs/2508.04416, 2025. doi: 10.48550/ARXIV.2508.04416. URLhttps://doi.org/10.48550/arXiv. 2508.04416

  19. [19]

    Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

    Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback. arXiv preprint arXiv:2507.20766, 2025

  20. [20]

    REARANK: reasoning re-ranking agent via reinforcement learning

    Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. REARANK: reasoning re-ranking agent via reinforcement learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, p...

  21. [21]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps: //openreview.net/forum?id=v8L0pN6EOi

  22. [22]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Y. Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.CoRR, abs/2211.14275, 2022. doi: 10.48550/ARXIV.2211.14275. URLhttps://doi.org/10.48550/arXiv.2211.14275

  23. [23]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347

  24. [24]

    Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms

    Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware RL for creative problem solving in llms. CoRR, abs/2601.08763, 2026. doi: 10.48550/ARXIV.2601.08763. URLhttps://doi.org/10.48550/arXiv.2601.08763

  25. [25]

    Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

    Yuda Song, Julia Kempe, and Rémi Munos. Outcome-based exploration for LLM reasoning.CoRR, abs/2509.06941,

  26. [26]

    Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

    doi: 10.48550/ARXIV.2509.06941. URLhttps://doi.org/10.48550/arXiv.2509.06941

  27. [27]

    In: CVPR

    Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie Zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, and JifengDai. Automc-reward: Automateddenserewarddesignwithlargelanguagemodelsforminecraft. InIEEE/CVF ConferenceonComputerVisionandPatternRecognition,CVPR2024,Seattle,WA,USA,June16-22,2024,pages16426–16435. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01554....

  28. [28]

    Multi-objective evolution of heuristic usinglargelanguagemodel

    Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic usinglargelanguagemodel. InTobyWalsh, JulieShah, andZicoKolter, editors,AAAI-25, SponsoredbytheAssociation for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 27144–27152. AAAI Press, 2025. d...

  29. [29]

    Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

    Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, and Xiangyang Ji. Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...

  30. [30]

    Revolve: Reward evolution with large language models using human feedback

    Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, and Pedro Zuidberg Dos Martires. Revolve: Reward evolution with large language models using human feedback. InThe Thirteenth International Conference on Learning Representations,ICLR2025,Singapore,April24-28,2025.OpenReview.net,2025. URL https://openreview.net/forum? id=cJPUpL8mOw

  31. [31]

    Eureka: Human-level reward design via coding large language models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe TwelfthInternationalConferenceonLearningRepresentations,ICLR2024,Vienna,Austria,May7-11,2024.OpenReview.net,

  32. [32]

    URLhttps://openreview.net/forum?id=IEduRUO55F

  33. [33]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

  34. [34]

    Daniel Freeman, Theodore R

    AdlyTempleton,TomConerly,JonathanMarcus,JackLindsey,TrentonBricken,BrianChen,AdamPearce,CraigCitro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extrac...

  35. [35]

    In-context learning and induction heads.Transformer Circuits Thread, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  36. [36]

    International Conference on Learning Representations (ICLR) , year=

    Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition.CoRR, abs/2504.20938, 2025. doi: 10.48550/ARXIV.2504.20938. URLhttps://doi.org/10.48550/arXiv.2504.20938

  37. [37]

    Stefan Heimersheim and Neel Nanda

    Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.CoRR, abs/2410.20526, 2024. doi: 10.48550/ARXIV.2410.20526. URL https://doi.org/10.48550/arXiv.2410.20526

  38. [38]

    Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025

    ZhengZhao, YeskendirKoishekenov, XianjunYang, NailaMurray, andNicolaCancedda. Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025. doi: 10.48550/ARXIV.2510.09312. URL https://doi.org/10.48550/arXiv.2510.09312

  39. [39]

    Bottom-up policy optimization: Your language model policy secretly contains internal policies

    Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies.CoRR, abs/2512.19673, 2025. doi: 10.48550/ARXIV.2512.19673. URLhttps://doi.org/10.48550/arXiv.2512.19673

  40. [40]

    Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

    Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.CoRR, abs/2601.07372, 2026. doi: 10.48550/ARXIV.2601.07372. URLhttps://doi.org/10.48...

  41. [41]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021...

  42. [42]

    Understanding R1-Zero-Like Training: A Critical Perspective

    ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783, 2025. doi: 10.48550/ARXIV.2503.20783. URL https://doi.org/10.48550/arXiv.2503.20783

  43. [43]

    Hybridflow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM,

  44. [44]

    Sheng, C

    doi: 10.1145/3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075

  45. [45]

    Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

    DaixuanCheng, ShaohanHuang, XuekaiZhu, BoDai, WayneXinZhao, ZhenliangZhang, andFuruWei. Reasoning with exploration: An entropy perspective.ArXiv, abs/2506.14758, 2025. URLhttps://api.semanticscholar.org/ CorpusID:279410424

  46. [46]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  47. [47]

    Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

  48. [48]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedi...

  49. [49]

    1 𝐾 𝐾Õ 𝑖=1 min 𝑟𝑖(𝜃)𝐴𝑖 ,clip(𝑟𝑖(𝜃),1−𝜖,1+𝜖)𝐴𝑖 !# −𝛽𝔻KL[𝜋𝜃∥𝜋ref] (1) ℒDAPO(𝜃)=𝔼𝑞∼𝒟

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Informa- tion Processing Systems 36: Annual Confer...

  50. [50]

    Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer

    Thus, 89 is closer and we need to check it more precisely: 892 =7921and90 2 =8100. Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer. Let’s check the closest integer values. We need𝑝 to be an integer and prime. Let’s check if 89 is a solution by substituting it back: 𝑝2+𝑝−2021=892...

  51. [51]

    The factors of𝑝2 are1, 𝑝, and𝑝 2

    Understanding the Factors:For a number𝑛 to have exactly three distinct positive factors, the most common form of such a number is𝑝2, where𝑝is a prime number. The factors of𝑝2 are1, 𝑝, and𝑝 2. So, if𝑛=𝑝 2, the sum of the factors1+𝑝+𝑝 2 =2022. Therefore, we have: 𝑝2+𝑝+1=2022=⇒𝑝 2+𝑝−2021=0 21

  52. [52]

    "" Returns a sorted list of all divisors of n

    Solving the Quadratic Equation:We solve for𝑝 using the quadratic formula𝑝= −𝑏± √ 𝑏2−4𝑎𝑐 2𝑎 where 𝑎=1, 𝑏=1 , and𝑐=−2021: 𝑝= −1± √ 1+4·2021 2 =−1± √ 8085 2 Since 𝑝 is a prime number, we need to check if √ 8085 is an integer and if−1+ √ 8085 2 is a prime number. We find that√ 8085=89.917. This means the closest integer value for the square root is 89. Checki...

  53. [53]

    The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022

    Identify the Structure of Divisors: • For a number𝑛 to have three distinct divisors, they must be in the form1, 𝑑, 𝑛 𝑑 where 𝑑 is a divisor of𝑛 other than 1 and𝑛. The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022. •This equation can be rearranged to:1+𝑑+ 𝑛 𝑑 =2022or𝑑+ 𝑛 𝑑 =2021. Thus,𝑛=𝑑(2021−𝑑)

  54. [54]

    reasoning path

    Find the Smallest𝑛: • We need to find pairs(𝑑,𝑛 𝑑)such that their sum is 2021 and𝑑≠1, 𝑛 𝑑. Both 𝑑 and 𝑛 𝑑 must be integers, so𝑑 must be a divisor of𝑛and2021−𝑑must be an integer greater than𝑑(because𝑑≠𝑛 𝑑). •The smallest possible𝑑greater than 1 is 2, but let’s check all possible values systematically. We’ll iterate over possible divisors𝑑 that are less tha...