Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Chengchun Shi; Hongyi Zhou; Jin Zhu; Kai Ye; Shijin Gong; Xinyu Zhang

arxiv: 2604.28005 · v2 · pith:737LKEDPnew · submitted 2026-04-30 · 💻 cs.LG · stat.ML

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Shijin Gong , Kai Ye , Jin Zhu , Xinyu Zhang , Hongyi Zhou , Chengchun Shi This is my paper

Pith reviewed 2026-05-20 23:44 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords kernel smoothingadvantage estimationLLM reasoningreinforcement learningvalue function estimationpolicy optimizationnonparametric statistics

0 comments

The pith

Kernel smoothing delivers accurate value estimates for LLM policy optimization using only a few reasoning traces per prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts kernel smoothing from nonparametric statistics to estimate value functions in reinforcement learning for large language models. In resource-limited settings where only a small number of reasoning traces can be sampled per prompt, existing approaches either train costly value networks or rely on high-variance single-sample estimates. Kernel smoothing weights nearby traces to produce low-bias value and gradient estimates without these drawbacks. Theoretical and numerical results indicate this yields improved policy optimization for reasoning tasks.

Core claim

Kernel smoothing applied to reasoning traces produces accurate estimates of the value function and its gradients, enabling more effective policy updates in LLM reasoning even when sample sizes per prompt remain small.

What carries the argument

Kernel smoothing, which estimates the value of a reasoning trace by averaging outcomes from similar traces weighted by a kernel function in a nonparametric manner.

If this is right

Lower-variance policy gradients become available without training a separate value network.
Sample efficiency improves relative to single-trajectory methods while keeping per-prompt cost low.
Policy optimization reaches higher-quality reasoning behaviors under fixed computational budgets.
Theoretical error bounds on value estimation translate directly into convergence rates for the learned policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same nonparametric smoothing idea could apply to other structured decision sequences such as code generation or mathematical proofs.
Hybrid methods that combine kernel estimates with occasional neural value networks might further reduce variance in very long traces.
If reasoning traces lie on a low-dimensional manifold, even simpler kernel choices could suffice and lower computational cost.

Load-bearing premise

The value function over reasoning traces admits a sufficiently smooth representation in a kernel-induced space so that smoothing with small per-prompt sample sizes yields low-bias estimates.

What would settle it

An experiment in which kernel-smoothed estimates with few samples per prompt produce higher bias or variance than simple averaging, or fail to improve final policy performance over baselines.

Figures

Figures reproduced from arXiv: 2604.28005 by Chengchun Shi, Hongyi Zhou, Jin Zhu, Kai Ye, Shijin Gong, Xinyu Zhang.

**Figure 1.** Figure 1: Expected rewards of one-shot GRPO (Wang et al., 2025b), the oracle algorithm, and our method (denoted as KAE) on training (left) and testing (right) datasets in the one-shot regime where the training data consists of a single observation. One-shot GRPO applies the standard GRPO algorithm directly to this regime. Shaded areas represent confidence intervals. experiments to validate these advantages over both… view at source ↗

**Figure 2.** Figure 2: Illustrations of a generic algorithm that unifies A2C, REINFORCE- and GRPO-type algorithms. 1. The first approach is A2C, which introduces a critic function C(X) to serve as a baseline and replaces the reward Z with an advantage function A = Z − C(X) in constructing the policy gradient estimator gb(θ). Its main idea is that ∇θ log πθ(Y |X) is a score function, and thus multiplying it by any C(X) yields a … view at source ↗

**Figure 3.** Figure 3: MSE of KAE’s value estimator on the MATH dataset across three training steps under varying kernel bandwidths. The left and right panels visualize the MSEs under the triangular and exponential kernels, respectively. Horizontal lines denote the MSEs of REINFORCE++ and GRPO, which are independent of bandwidth and kernel function view at source ↗

**Figure 4.** Figure 4: Test accuracy of models post-trained with standard REINFORCE (blue), KAE (red), and a REINFORCE variant using the proposed prompt sampling scheme, on GSM8K (left) and MATH (right) across different training steps. Shaded areas represent the standard error of the accuracy curves, aggregated over five training replications. run. In contrast, since training on MATH is substantially more expensive, we report re… view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Kernelized Advantage Estimation, which applies kernel smoothing from nonparametric statistics to estimate value functions for policy gradients in LLM reasoning. In resource-constrained regimes with only a few samples per prompt, the method seeks low-variance advantage estimates without training a separate value network or drawing large Monte Carlo batches, claiming improved policy optimization backed by numerical experiments and theoretical analysis.

Significance. If the kernel estimator delivers the claimed low-bias value and gradient estimates under small per-prompt sample sizes, the work would usefully import classical nonparametric tools into LLM RL, offering a lightweight alternative to value networks while addressing practical sampling costs. The explicit use of kernel methods for advantage estimation is a clear strength.

major comments (2)

[§3.1] §3.1, Assumption 1 and the subsequent convergence theorem: the central claim that kernel smoothing yields low-bias estimates from small per-prompt samples rests on the value function over reasoning traces lying in a sufficiently smooth RKHS. No domain-specific verification or counterexample analysis is supplied for the discrete, combinatorial space of token sequences typical in LLM reasoning; without this, the invoked nonparametric rates may not apply and bias could offset the reported variance reduction.
[§5.2] §5.2, bandwidth selection procedure: the kernel bandwidth is treated as a free parameter whose choice is not automated or cross-validated within the reported experiments. Post-hoc tuning could inflate the numerical gains in policy optimization, weakening the robustness of the empirical support for the method.

minor comments (2)

[Abstract] The abstract states that 'numerical and theoretical results demonstrate' the claims but does not name the specific theorem or kernel family; adding one sentence would improve readability.
[Method] Notation for the kernel applied to variable-length reasoning traces (token sequences vs. embeddings) should be made explicit in the method section to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [§3.1] §3.1, Assumption 1 and the subsequent convergence theorem: the central claim that kernel smoothing yields low-bias estimates from small per-prompt samples rests on the value function over reasoning traces lying in a sufficiently smooth RKHS. No domain-specific verification or counterexample analysis is supplied for the discrete, combinatorial space of token sequences typical in LLM reasoning; without this, the invoked nonparametric rates may not apply and bias could offset the reported variance reduction.

Authors: We thank the referee for this observation. Our theoretical analysis is developed under the standard RKHS smoothness assumption (Assumption 1) from nonparametric statistics, which enables the stated convergence rates. We acknowledge that the discrete, combinatorial structure of token sequences may not automatically satisfy this without further justification. In the revised manuscript we will add a dedicated paragraph discussing kernel construction via embedding-based similarities (e.g., cosine similarity on sentence embeddings or edit-distance kernels) that empirically induce the required regularity, together with additional diagnostic plots from our LLM experiments showing that bias remains small relative to the observed variance reduction. We will also note the assumption's scope explicitly. revision: partial
Referee: [§5.2] §5.2, bandwidth selection procedure: the kernel bandwidth is treated as a free parameter whose choice is not automated or cross-validated within the reported experiments. Post-hoc tuning could inflate the numerical gains in policy optimization, weakening the robustness of the empirical support for the method.

Authors: We agree that post-hoc bandwidth selection limits the strength of the empirical claims. In the revised version we will replace the current procedure with an automated, data-driven method (leave-one-out cross-validation on the small per-prompt sample set, or a scaled Silverman's rule adapted to the kernel on embeddings). We will re-run the main experiments with this procedure and report the resulting policy optimization metrics to confirm that the reported gains are retained under automated selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard nonparametric estimator imported externally

full rationale

The paper applies classical kernel smoothing from nonparametric statistics as an external technique for value function estimation over reasoning traces, without reducing its key results to parameters fitted inside the same work or to self-citations. The derivation chain imports a pre-existing statistical method and demonstrates its use for policy optimization, with claims of numerical and theoretical support resting on the standard convergence properties of kernel estimators rather than any tautological redefinition or internal fit. This keeps the central argument self-contained against external benchmarks in statistics literature.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of kernel smoothing to the discrete space of LLM reasoning traces and on the existence of a suitable kernel and bandwidth that work with very few samples per prompt.

free parameters (1)

kernel bandwidth
Bandwidth controls the degree of smoothing and is typically selected or tuned; its value is not reported in the abstract.

axioms (1)

domain assumption Value functions over reasoning traces are sufficiently regular for kernel smoothing to produce accurate estimates at small sample sizes.
This premise is required for the nonparametric estimator to outperform both value-network and pure Monte-Carlo baselines in the stated regime.

pith-pipeline@v0.9.0 · 5758 in / 1221 out tokens · 58211 ms · 2026-05-20T23:44:40.873348+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ kernel smoothing as a concrete example for value function estimation... bV_i(x) = 1/(i h) ∑ K((i-j)/(i h)) Z_j ... Assumption 4 (Smoothness): V^π_θ(x) is p-times continuously differentiable
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KAE achieves Stone’s optimal convergence rate... MSE(bV) = O([N_i(x)]^{-2p/(2p+1)})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Perturbations to Extrapolate Your LLM
stat.ML 2026-05 unverdicted novelty 6.0

A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

work page arXiv
[2]

Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

Young Hyun Cho and Will Wei Sun. Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

work page arXiv
[3]

arXiv preprint arXiv:2504.02546 , year=

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546,

work page arXiv
[4]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

work page arXiv
[5]

Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

Asim H Gazi, Yongyi Guo, Daiqi Gao, Ziping Xu, Kelly W Zhang, and Susan A Murphy. Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

work page arXiv
[6]

A Review of Causal Decision Making

Lin Ge, Hengrui Cai, Runzhe Wan, Yang Xu, and Rui Song. A review of causal decision making. arXiv preprint arXiv:2502.16156,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

work page arXiv
[8]

arXiv preprint arXiv:2505.23585 , year=

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585,

work page arXiv
[9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, and Yuxin Chen. On the learning dynamics of rlvr at the edge of competence.arXiv preprint arXiv:2602.14872,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

work page 2019
[13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

18 Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Low-rank contextual reinforcement learning from heterogeneous human feedback.arXiv preprint arXiv:2412.19436,

Seong Jin Lee, Will Wei Sun, and Yufeng Liu. Low-rank contextual reinforcement learning from heterogeneous human feedback.arXiv preprint arXiv:2412.19436,

work page arXiv
[15]

Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025a

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025a. Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026b. Yuhan Li, Eugene Han, Yifan Hu, Zhenglin...

work page arXiv
[16]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv
[17]

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J Su, and Jiancong Xiao. Statistical impossibility and possibility of aligning llms with human preferences: From condorcet paradox to nash equilibrium. arXiv preprint arXiv:2503.10990, 2025a. Pangpang Liu, Junwei Lu, and Will Wei Sun. Uncertainty quantification for large language model reward learning under heterog...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Fin-R1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

Weidong Liu, Jiyuan Tu, Xi Chen, and Yichen Zhang. Online estimation and inference for robust policy evaluation in reinforcement learning.The Annals of Statistics, 53(5):2128–2152, 2025c. Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, et al. Fin-r1: A large language model for financial r...

work page arXiv
[19]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937. PMLR,

work page 1928
[20]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a. Chengchun Shi, Zhengling Qi, Jianin g Wang, and Fan Zhou. Value enhancement of reinforcement learning via efficient and ro...

work page 2011
[23]

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang, Congbo Ma, Ian Reid, and Mohammad Yaqub. Kalman filter enhanced grpo for rein- forcement learning-based language model reasoning.arXiv preprint arXiv:2505.07527, 2025a. Jiayi Wang, Zhengling Qi, and Raymond KW Wong. Projected state-action balancing weights for offline reinforcement learning.The Annals of Statistics, 51(4):1639–1665,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025b. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et ...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

A statistical framework for alignment with biased ai feedback.arXiv preprint arXiv:2602.08259,

Xintao Xia, Zhiqiu Xia, Linjun Zhang, and Zhanrui Cai. A statistical framework for alignment with biased ai feedback.arXiv preprint arXiv:2602.08259,

work page arXiv
[26]

A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

work page arXiv
[27]

Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

Zhongwen Xu and Zihan Ding. Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

work page arXiv
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

21 Guanning Zeng, Zhaoyi Zhou, Daman Arora, and Andrea Zanette. Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

work page arXiv
[30]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shao- han Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673,

work page arXiv
[31]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel...

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2603.01162 , year=

Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, and Tianxi Cai. Federated offline reinforcement learning.Journal of the American Statistical Association, 119 (548):3152–3163, 2024a. Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demys- tifying group relative policy optimization: Its policy gradient...

work page arXiv
[33]

Estimating optimal infinite horizon dynamic treat- ment regimes via pt-learning.Journal of the American Statistical Association, 119(545):625–638, 2024b

Wenzhuo Zhou, Ruoqing Zhu, and Annie Qu. Estimating optimal infinite horizon dynamic treat- ment regimes via pt-learning.Journal of the American Statistical Association, 119(545):625–638, 2024b. Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, and Xiaowu Dai. Align: Aligned delegation with performance guarantees for multi-agent llm reasoni...

work page arXiv

[1] [1]

Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

work page arXiv

[2] [2]

Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

Young Hyun Cho and Will Wei Sun. Privacy-preserving reinforcement learning from human feed- back via decoupled reward modeling.arXiv preprint arXiv:2603.22563,

work page arXiv

[3] [3]

arXiv preprint arXiv:2504.02546 , year=

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546,

work page arXiv

[4] [4]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

work page arXiv

[5] [5]

Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

Asim H Gazi, Yongyi Guo, Daiqi Gao, Ziping Xu, Kelly W Zhang, and Susan A Murphy. Statistical reinforcement learning in the real world: A survey of challenges and future directions.arXiv preprint arXiv:2601.15353,

work page arXiv

[6] [6]

A Review of Causal Decision Making

Lin Ge, Hengrui Cai, Runzhe Wan, Yang Xu, and Rui Song. A review of causal decision making. arXiv preprint arXiv:2502.16156,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, and Lizhu Zhang. Ebpo: Empirical bayes shrinkage for stabilizing group-relative policy optimization.arXiv preprint arXiv:2602.05165,

work page arXiv

[8] [8]

arXiv preprint arXiv:2505.23585 , year=

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585,

work page arXiv

[9] [9]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, and Yuxin Chen. On the learning dynamics of rlvr at the edge of competence.arXiv preprint arXiv:2602.14872,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! InICLR 2019 Workshop on Deep Reinforcement Learning Meets Structured Prediction,

work page 2019

[13] [13]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

18 Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Low-rank contextual reinforcement learning from heterogeneous human feedback.arXiv preprint arXiv:2412.19436,

Seong Jin Lee, Will Wei Sun, and Yufeng Liu. Low-rank contextual reinforcement learning from heterogeneous human feedback.arXiv preprint arXiv:2412.19436,

work page arXiv

[15] [15]

Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025a

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. Repo: Replay-enhanced policy optimization.arXiv preprint arXiv:2506.09340, 2025a. Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction for grpo.arXiv preprint arXiv:2603.13134, 2026b. Yuhan Li, Eugene Han, Yifan Hu, Zhenglin...

work page arXiv

[16] [16]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

work page arXiv

[17] [17]

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu, Qi Long, Zhekun Shi, Weijie J Su, and Jiancong Xiao. Statistical impossibility and possibility of aligning llms with human preferences: From condorcet paradox to nash equilibrium. arXiv preprint arXiv:2503.10990, 2025a. Pangpang Liu, Junwei Lu, and Will Wei Sun. Uncertainty quantification for large language model reward learning under heterog...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Fin-R1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

Weidong Liu, Jiyuan Tu, Xi Chen, and Yichen Zhang. Online estimation and inference for robust policy evaluation in reinforcement learning.The Annals of Statistics, 53(5):2128–2152, 2025c. Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, et al. Fin-r1: A large language model for financial r...

work page arXiv

[19] [19]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937. PMLR,

work page 1928

[20] [20]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a. Chengchun Shi, Zhengling Qi, Jianin g Wang, and Fan Zhou. Value enhancement of reinforcement learning via efficient and ro...

work page 2011

[23] [23]

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang, Congbo Ma, Ian Reid, and Mohammad Yaqub. Kalman filter enhanced grpo for rein- forcement learning-based language model reasoning.arXiv preprint arXiv:2505.07527, 2025a. Jiayi Wang, Zhengling Qi, and Raymond KW Wong. Projected state-action balancing weights for offline reinforcement learning.The Annals of Statistics, 51(4):1639–1665,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025b. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et ...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

A statistical framework for alignment with biased ai feedback.arXiv preprint arXiv:2602.08259,

Xintao Xia, Zhiqiu Xia, Linjun Zhang, and Zhanrui Cai. A statistical framework for alignment with biased ai feedback.arXiv preprint arXiv:2602.08259,

work page arXiv

[26] [26]

A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

work page arXiv

[27] [27]

Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

Zhongwen Xu and Zihan Ding. Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

work page arXiv

[28] [28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

21 Guanning Zeng, Zhaoyi Zhou, Daman Arora, and Andrea Zanette. Shrinking the variance: Shrink- age baselines for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2511.03710,

work page arXiv

[30] [30]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shao- han Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673,

work page arXiv

[31] [31]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025a. Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel...

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2603.01162 , year=

Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, and Tianxi Cai. Federated offline reinforcement learning.Journal of the American Statistical Association, 119 (548):3152–3163, 2024a. Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demys- tifying group relative policy optimization: Its policy gradient...

work page arXiv

[33] [33]

Estimating optimal infinite horizon dynamic treat- ment regimes via pt-learning.Journal of the American Statistical Association, 119(545):625–638, 2024b

Wenzhuo Zhou, Ruoqing Zhu, and Annie Qu. Estimating optimal infinite horizon dynamic treat- ment regimes via pt-learning.Journal of the American Statistical Association, 119(545):625–638, 2024b. Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, and Xiaowu Dai. Align: Aligned delegation with performance guarantees for multi-agent llm reasoni...

work page arXiv