arxiv: 2604.11297 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Yang Liu , Enxi Wang , Yufei Gao , Weixin Zhang , Bo Wang , Zhiyuan Zeng , Yikai Zhang , Yining Zheng

show 1 more author

Xipeng Qiu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learninglarge language modelsreward shapingmemorysampling diversityerror patternsdynamic rewards

0 comments

The pith

Storing past rollout features and clustering recurring errors lets dynamic penalties raise diversity and accuracy in language-model reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MEDS to fix a common problem in reinforcement learning for large language models: policies that keep producing the same mistakes across many attempts. It stores intermediate representations from earlier rollouts, runs density-based clustering on those features to find which error patterns appear most often, and then applies stronger penalties to rollouts that match the popular error clusters. This memory-based adjustment supplements ordinary entropy terms by explicitly discouraging repeated failures rather than just adding generic randomness. A reader would care because higher diversity in sampling can translate into better final performance on tasks that require exploring many possible answers. The reported results show consistent gains across multiple datasets and base models when this shaping is used.

Core claim

By storing intermediate model representations from previous rollouts and applying density-based clustering to detect frequently recurring error patterns, MEDS dynamically shapes rewards to penalize prevalent mistakes more heavily. This encourages broader exploration, reduces repeated erroneous behaviors, and yields higher average performance than standard baselines.

What carries the argument

MEDS (Memory-Enhanced Dynamic reward Shaping), which stores historical intermediate representations, clusters them to identify recurrent error patterns, and adjusts per-rollout rewards accordingly.

If this is right

Across five datasets and three base models, MEDS raises pass@1 by up to 4.13 points and pass@128 by up to 4.37 points over existing methods.
Behavioral diversity rises during sampling, confirmed by both LLM annotations and quantitative metrics.
Rollouts matching common error clusters receive heavier penalties, which the method claims directly reduces looping on the same failures.
The approach targets a failure mode that standard entropy regularization does not address explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory-and-cluster idea could be tested in other sequential decision settings where policies repeat suboptimal actions.
If the stored representations capture task-relevant features, the method might reduce reliance on hand-crafted reward terms in future RL setups.
Extending the memory window or trying different clustering thresholds could be checked to see whether longer history improves or harms results.

Load-bearing premise

Density-based clustering on stored intermediate representations will correctly group and flag detrimental recurrent error patterns so that extra penalties on them produce useful exploration instead of suppressing valid answer variations.

What would settle it

Running the same training loops without the clustering step or with randomly assigned penalties and finding no drop in diversity metrics or performance would show that the targeted identification of error patterns is not what drives the gains.

read the original abstract

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEDS uses stored representations and density clustering to penalize repeated rollout patterns in LLM RL, but without error filtering the clusters could hit common correct behaviors too.

read the letter

The core idea is to keep a running memory of intermediate representations from past rollouts, run density-based clustering on them, and then apply stronger penalties to rollouts that land in denser clusters. This is positioned as a step past plain entropy regularization, which only pushes for randomness without targeting the specific failure modes that keep showing up across episodes. The experiments report steady gains over baselines on five datasets and three base models, with lifts up to 4.13 pass@1 and 4.37 pass@128, plus supporting checks on diversity metrics and LLM-based annotations. That multi-setup coverage is better than the usual single-model RL fine-tuning note. The approach is straightforward to implement on top of existing PPO-style loops, so practitioners already running RL on LLMs could try it without a full rewrite. The soft spot is exactly the one the stress-test flags: the clustering step has no described filter or correctness check against ground truth. Dense clusters could easily contain valid but frequent outputs rather than only detrimental repeats, which would mean the penalty suppresses useful variation instead of fixing the intended problem. The abstract gives no ablations on this point, no per-cluster error rates, and no discussion of the extra compute or memory cost from storing and clustering the representations. Hyperparameters for the clustering and penalty scale are left as free choices without sensitivity analysis. This paper is for groups already working on reward shaping or diversity issues in LLM RL training. The empirical numbers are concrete enough that a referee could check whether the mechanism actually delivers targeted gains or just incidental regularization. I would send it to review with a request for those ablations and a clearer account of how clusters are validated as error-heavy.

Referee Report

2 major / 2 minor

Summary. The paper proposes MEDS, a Memory-Enhanced Dynamic reward Shaping method for RL in LLMs. It stores intermediate representations from past rollouts, applies density-based clustering to detect frequently recurring patterns (interpreted as errors), and applies heavier penalties to rollouts in denser clusters. This is intended to reduce repetitive mistakes and increase exploration beyond standard entropy regularization. Experiments across five datasets and three base models report consistent gains (up to 4.13 pass@1 and 4.37 pass@128) plus improved diversity metrics from LLM annotations and quantitative measures.

Significance. If the core mechanism reliably penalizes detrimental recurrent errors rather than common valid behaviors, MEDS would offer a practical extension to reward shaping that directly targets historical failure modes in LLM sampling. The multi-dataset, multi-model evaluation and dual diversity analyses (qualitative and quantitative) provide a reasonable basis for claiming broader applicability, though verification of the error-identification assumption is required for the result to be load-bearing.

major comments (2)

[Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.
[Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.

minor comments (2)

[Experiments] The abstract and results mention 'LLM-based annotations' for diversity but provide no details on the annotation prompt, model used, or inter-annotator agreement; this should be clarified for reproducibility.
[Method] Notation for the penalty scaling factor and clustering hyperparameters is introduced without explicit equations or pseudocode; adding a short algorithm box would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, acknowledging where the manuscript can be strengthened through revisions and providing clarifications on the methodological assumptions and experimental reporting.

read point-by-point responses

Referee: [Method] Method description (around the clustering and penalty step): density-based clustering is performed on stored intermediate representations without any described filtering step that distinguishes error patterns from frequently occurring correct solutions (e.g., no per-cluster success rate check against ground truth or exclusion of successful rollouts before clustering). This makes the central claim that penalties reduce repeated mistakes rather than suppress useful variations dependent on an unverified assumption.

Authors: We agree that the method relies on the assumption that dense clusters in the stored representations primarily capture recurrent error patterns rather than common correct behaviors. This interpretation is motivated by the nature of the tasks (e.g., code generation), where repeated failures often manifest as similar intermediate representations, while successful solutions tend to be more diverse. However, we acknowledge that this assumption was not explicitly verified in the original submission. In the revision, we will add a new analysis subsection that evaluates cluster purity by computing the average success rate (using ground-truth labels) for rollouts assigned to each cluster. We will also report the proportion of successful rollouts excluded or down-weighted and discuss cases where clusters contain mixed outcomes. This will provide empirical grounding for the error-identification claim and allow readers to assess the assumption directly. revision: yes
Referee: [Experiments] Experimental results section: the reported average improvements lack accompanying statistical significance tests, exact baseline hyperparameter settings, implementation details for the three base models, and controls for potential confounds such as extra compute or memory overhead from storing and clustering representations. Without these, it is difficult to attribute the 4.13/4.37 point gains specifically to the dynamic shaping rather than incidental regularization effects.

Authors: We accept that the experimental section requires additional rigor for reproducibility and to isolate the contribution of MEDS. In the revised version, we will include: (1) statistical significance tests (paired t-tests across 5 random seeds) for all reported pass@k improvements; (2) complete hyperparameter tables for baselines and MEDS, including learning rates, entropy coefficients, memory buffer sizes, and clustering parameters (eps and min_samples for DBSCAN); (3) implementation details for the three base models, specifying exact model checkpoints, LoRA configurations, and training hardware; and (4) a new subsection with compute/memory measurements showing that the overhead of representation storage and clustering is under 5% of total training time, plus an ablation that disables the density-based penalty while retaining the memory buffer to control for incidental regularization. These additions will strengthen attribution of the gains to the dynamic shaping mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with independent clustering step

full rationale

The paper describes MEDS as storing intermediate representations from rollouts, applying density-based clustering to identify recurring patterns, and penalizing denser clusters to encourage exploration. No equations, derivations, or self-citations are shown that reduce the claimed performance gains (e.g., pass@1 improvements) to a quantity defined in terms of itself or fitted directly to the target metric. The central mechanism relies on an external clustering procedure applied to stored features rather than any self-referential definition, fitted-input prediction, or load-bearing self-citation chain. Experimental results across datasets and models provide independent validation, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on unstated assumptions about representation quality and clustering validity without independent evidence in the provided abstract.

free parameters (2)

clustering hyperparameters
Density-based clustering requires parameters such as neighborhood radius and minimum points per cluster that must be chosen or tuned to define error groups.
penalty scaling factor
The strength with which prevalent clusters are penalized is not specified and likely requires selection to balance exploration and performance.

axioms (2)

domain assumption Intermediate model representations encode distinguishable features of behavioral error patterns across rollouts.
Invoked implicitly when storing representations to enable clustering of recurring mistakes.
domain assumption Penalizing rollouts in high-density error clusters promotes broader exploration without harming overall learning.
Central to the reward shaping logic but not justified in the abstract.

pith-pipeline@v0.9.0 · 5482 in / 1304 out tokens · 32457 ms · 2026-05-10T15:30:06.209126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 33 canonical work pages · 7 internal anchors

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.48550/ARXIV.2402.03300. URLhttps://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[3]

Stepcoder: Improve code generation with reinforcement learning from compiler feedback

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. Stepcoder: Improve code generation with reinforcement learning from compiler feedback.CoRR, abs/2402.01391, 2024. doi: 10.48550/ARXIV.2402.01391. URLhtt...

work page doi:10.48550/arxiv.2402.01391 2024
[4]

Execution-basedcodegenerationusingdeep reinforcement learning.Trans

ParshinShojaee,AneeshJain,SindhuTipirneni,andChandanK.Reddy. Execution-basedcodegenerationusingdeep reinforcement learning.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=0XBuaxqEcG

2023
[5]

Christiano, Jan Leike, and Ryan Lowe

LongOuyang,JeffreyWu,XuJiang,DiogoAlmeida,CarrollL.Wainwright,PamelaMishkin,ChongZhang,Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi...

2022
[6]

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

GokulSwamy,SanjibanChoudhury,WenSun,ZhiweiStevenWu,andJ.AndrewBagnell. Allroadsleadtolikelihood: The value of reinforcement learning in fine-tuning.CoRR, abs/2503.01067, 2025. doi: 10.48550/ARXIV.2503.01067. URLhttps://doi.org/10.48550/arXiv.2503.01067

work page doi:10.48550/arxiv.2503.01067 2025
[7]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, XiangpengWei,HaoZhou,JingjingLiu,Wei-Yin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[8]

Expected return causes outcome-level mode collapse in re- inforcement learning and how to fix it with inverse probability scaling.CoRR, abs/2601.21669, 2026

Abhĳeet Sinha, Sundari Elango, and Dianbo Liu. Expected return causes outcome-level mode collapse in re- inforcement learning and how to fix it with inverse probability scaling.CoRR, abs/2601.21669, 2026. doi: 10.48550/ARXIV.2601.21669. URLhttps://doi.org/10.48550/arXiv.2601.21669

work page doi:10.48550/arxiv.2601.21669 2026
[9]

EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen, Xueting Han, Qizhou Wang ands Bo Han, Jing Bai, Hinrich Schutze, and Kam-Fai Wong. EEPO: exploration-enhanced policy optimization via sample-then-forget.CoRR, abs/2510.05837, 2025. doi: 10.48550/ ARXIV.2510.05837. URLhttps://doi.org/10.48550/arXiv.2510.05837

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.05837 2025
[10]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning.CoRR, abs/2506.01347, 2025. doi: 10.48550/ARXIV.2506.01347. URL https://doi.org/10.48550/arXiv.2506.01347

work page doi:10.48550/arxiv.2506.01347 2025
[11]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of...

2018
[12]

Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026. 11

work page arXiv 2026
[13]

Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, J...

2016
[14]

The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.Psychological Review, 109:679–709, 11 2002

Clay Holroyd and Michael Coles. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.Psychological Review, 109:679–709, 11 2002. doi: 10.1037/0033-295X.109.4.679

work page doi:10.1037/0033-295x.109.4.679 2002
[15]

URLhttps://doi.org/10.21105/joss.00205

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.J. Open Source Softw., 2(11):205, 2017. doi: 10.21105/JOSS.00205. URLhttps://doi.org/10.21105/joss.00205

work page doi:10.21105/joss.00205 2017
[16]

OpenAI o1 System Card

OpenAI. Openai o1 system card.CoRR, abs/2412.16720, 2024. doi: 10.48550/ARXIV.2412.16720. URLhttps: //doi.org/10.48550/arXiv.2412.16720

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[17]

Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183,

YaoyuZhu,DiHuang,HanqiLyu,XiaoyunZhang,ChongxiaoLi,WenxuanShi,YutongWu,JiananMu,JinghuaWang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, andYunjiChen. Codev-r1: Reasoning-enhancedveriloggeneration,2025. URL https://arxiv.org/abs/2505.24183

work page arXiv 2025
[18]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.CoRR, abs/2508.04416, 2025. doi: 10.48550/ARXIV.2508.04416. URLhttps://doi.org/10.48550/arXiv. 2508.04416

work page doi:10.48550/arxiv.2508.04416 2025
[19]

Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback. arXiv preprint arXiv:2507.20766, 2025

work page arXiv 2025
[20]

REARANK: reasoning re-ranking agent via reinforcement learning

Le Zhang, Bo Wang, Xipeng Qiu, Siva Reddy, and Aishwarya Agrawal. REARANK: reasoning re-ranking agent via reinforcement learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, p...

work page doi:10.18653/v1/2025.emnlp-main.125 2025
[21]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps: //openreview.net/forum?id=v8L0pN6EOi

2024
[22]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Y. Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.CoRR, abs/2211.14275, 2022. doi: 10.48550/ARXIV.2211.14275. URLhttps://doi.org/10.48550/arXiv.2211.14275

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.14275 2022
[23]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Rewarding the rare: Uniqueness-aware rl for creative problem solving in llms

Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, and Bryan Hooi. Rewarding the rare: Uniqueness-aware RL for creative problem solving in llms. CoRR, abs/2601.08763, 2026. doi: 10.48550/ARXIV.2601.08763. URLhttps://doi.org/10.48550/arXiv.2601.08763

work page doi:10.48550/arxiv.2601.08763 2026
[25]

Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

Yuda Song, Julia Kempe, and Rémi Munos. Outcome-based exploration for LLM reasoning.CoRR, abs/2509.06941,

work page arXiv
[26]

Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

doi: 10.48550/ARXIV.2509.06941. URLhttps://doi.org/10.48550/arXiv.2509.06941

work page doi:10.48550/arxiv.2509.06941
[27]

In: CVPR

Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie Zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, and JifengDai. Automc-reward: Automateddenserewarddesignwithlargelanguagemodelsforminecraft. InIEEE/CVF ConferenceonComputerVisionandPatternRecognition,CVPR2024,Seattle,WA,USA,June16-22,2024,pages16426–16435. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01554....

work page doi:10.1109/cvpr52733.2024.01554 2024
[28]

Multi-objective evolution of heuristic usinglargelanguagemodel

Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic usinglargelanguagemodel. InTobyWalsh, JulieShah, andZicoKolter, editors,AAAI-25, SponsoredbytheAssociation for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 27144–27152. AAAI Press, 2025. d...

work page doi:10.1609/aaai.v39i25.34922 2025
[29]

Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, and Xiangyang Ji. Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. In Toby Walsh, Julie Shah, and Zico Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, U...

work page doi:10.1609/aaai.v39i19.34213 2025
[30]

Revolve: Reward evolution with large language models using human feedback

Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, and Pedro Zuidberg Dos Martires. Revolve: Reward evolution with large language models using human feedback. InThe Thirteenth International Conference on Learning Representations,ICLR2025,Singapore,April24-28,2025.OpenReview.net,2025. URL https://openreview.net/forum? id=cJPUpL8mOw

2025
[31]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InThe TwelfthInternationalConferenceonLearningRepresentations,ICLR2024,Vienna,Austria,May7-11,2024.OpenReview.net,

2024
[32]

URLhttps://openreview.net/forum?id=IEduRUO55F
[33]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

2022
[34]

Daniel Freeman, Theodore R

AdlyTempleton,TomConerly,JonathanMarcus,JackLindsey,TrentonBricken,BrianChen,AdamPearce,CraigCitro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extrac...

2024
[35]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

2022
[36]

International Conference on Learning Representations (ICLR) , year=

Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, and Xipeng Qiu. Towards understanding the nature of attention with low-rank sparse decomposition.CoRR, abs/2504.20938, 2025. doi: 10.48550/ARXIV.2504.20938. URLhttps://doi.org/10.48550/arXiv.2504.20938

work page doi:10.48550/arxiv.2504.20938 2025
[37]

Stefan Heimersheim and Neel Nanda

Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.CoRR, abs/2410.20526, 2024. doi: 10.48550/ARXIV.2410.20526. URL https://doi.org/10.48550/arXiv.2410.20526

work page doi:10.48550/arxiv.2410.20526 2024
[38]

Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025

ZhengZhao, YeskendirKoishekenov, XianjunYang, NailaMurray, andNicolaCancedda. Verifyingchain-of-thought reasoning via its computational graph.CoRR, abs/2510.09312, 2025. doi: 10.48550/ARXIV.2510.09312. URL https://doi.org/10.48550/arXiv.2510.09312

work page doi:10.48550/arxiv.2510.09312 2025
[39]

Bottom-up policy optimization: Your language model policy secretly contains internal policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies.CoRR, abs/2512.19673, 2025. doi: 10.48550/ARXIV.2512.19673. URLhttps://doi.org/10.48550/arXiv.2512.19673

work page doi:10.48550/arxiv.2512.19673 2025
[40]

Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.CoRR, abs/2601.07372, 2026. doi: 10.48550/ARXIV.2601.07372. URLhttps://doi.org/10.48...

work page doi:10.48550/arxiv.2601.07372 2026
[41]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021...

2021
[42]

Understanding R1-Zero-Like Training: A Critical Perspective

ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783, 2025. doi: 10.48550/ARXIV.2503.20783. URL https://doi.org/10.48550/arXiv.2503.20783

work page Pith review doi:10.48550/arxiv.2503.20783 2025
[43]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM,

2025
[44]

Sheng, C

doi: 10.1145/3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075
[45]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

DaixuanCheng, ShaohanHuang, XuekaiZhu, BoDai, WayneXinZhao, ZhenliangZhang, andFuruWei. Reasoning with exploration: An entropy perspective.ArXiv, abs/2506.14758, 2025. URLhttps://api.semanticscholar.org/ CorpusID:279410424

work page arXiv 2025
[46]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Zĳu Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

2024
[47]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

2022
[48]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedi...

work page doi:10.18653/v1/2024.acl-long.211 2024
[49]

1 𝐾 𝐾Õ 𝑖=1 min 𝑟𝑖(𝜃)𝐴𝑖 ,clip(𝑟𝑖(𝜃),1−𝜖,1+𝜖)𝐴𝑖 !# −𝛽𝔻KL[𝜋𝜃∥𝜋ref] (1) ℒDAPO(𝜃)=𝔼𝑞∼𝒟

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Di- rect preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Informa- tion Processing Systems 36: Annual Confer...

2023
[50]

Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer

Thus, 89 is closer and we need to check it more precisely: 892 =7921and90 2 =8100. Since89 2 is closer to 8085, we use it: √ 8085≈89.9166, and thus: 𝑝= −1+89.9166 2 ≈88.9166 2 ≈44.4583, which isn’t an integer. Let’s check the closest integer values. We need𝑝 to be an integer and prime. Let’s check if 89 is a solution by substituting it back: 𝑝2+𝑝−2021=892...

2021
[51]

The factors of𝑝2 are1, 𝑝, and𝑝 2

Understanding the Factors:For a number𝑛 to have exactly three distinct positive factors, the most common form of such a number is𝑝2, where𝑝is a prime number. The factors of𝑝2 are1, 𝑝, and𝑝 2. So, if𝑛=𝑝 2, the sum of the factors1+𝑝+𝑝 2 =2022. Therefore, we have: 𝑝2+𝑝+1=2022=⇒𝑝 2+𝑝−2021=0 21

2022
[52]

"" Returns a sorted list of all divisors of n

Solving the Quadratic Equation:We solve for𝑝 using the quadratic formula𝑝= −𝑏± √ 𝑏2−4𝑎𝑐 2𝑎 where 𝑎=1, 𝑏=1 , and𝑐=−2021: 𝑝= −1± √ 1+4·2021 2 =−1± √ 8085 2 Since 𝑝 is a prime number, we need to check if √ 8085 is an integer and if−1+ √ 8085 2 is a prime number. We find that√ 8085=89.917. This means the closest integer value for the square root is 89. Checki...

2021
[53]

The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022

Identify the Structure of Divisors: • For a number𝑛 to have three distinct divisors, they must be in the form1, 𝑑, 𝑛 𝑑 where 𝑑 is a divisor of𝑛 other than 1 and𝑛. The sum of these three divisors is1+𝑑+𝑛 𝑑 =2022. •This equation can be rearranged to:1+𝑑+ 𝑛 𝑑 =2022or𝑑+ 𝑛 𝑑 =2021. Thus,𝑛=𝑑(2021−𝑑)

2022
[54]

reasoning path

Find the Smallest𝑛: • We need to find pairs(𝑑,𝑛 𝑑)such that their sum is 2021 and𝑑≠1, 𝑛 𝑑. Both 𝑑 and 𝑛 𝑑 must be integers, so𝑑 must be a divisor of𝑛and2021−𝑑must be an integer greater than𝑑(because𝑑≠𝑛 𝑑). •The smallest possible𝑑greater than 1 is 2, but let’s check all possible values systematically. We’ll iterate over possible divisors𝑑 that are less tha...

2021