f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Pith reviewed 2026-05-16 06:46 UTC · model grok-4.3
The pith
f-GRPO estimates f-divergences between high- and low-reward response distributions to guide LLM alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
f-Group Relative Policy Optimization (f-GRPO) and f-Hybrid Alignment Loss (f-HAL) estimate f-divergences between reward-aligned and reward-unaligned distributions induced by above-average and below-average reward responses; optimizing them yields expected reward improvement after alignment, extending prior divergence interpretations beyond preference supervision to general LLM alignment including RLVR.
What carries the argument
f-GRPO, an on-policy objective that estimates f-divergences by thresholding scalar rewards at the batch mean to separate aligned and unaligned response distributions.
Load-bearing premise
Scalar rewards can be thresholded at the batch mean to induce well-behaved aligned and unaligned distributions whose f-divergence is a useful alignment objective.
What would settle it
An experiment on a math-reasoning RLVR task where f-GRPO optimization produces no measurable increase in average reward or no detectable f-divergence between the induced high- and low-reward distributions.
Figures
read the original abstract
Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy RL objectives, and $f$-Hybrid Alignment Loss ($f$-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate $f$-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment. Empirically, $f$-GRPO improves over GRPO on math-reasoning RLVR tasks, while hybrid $f$-HAL mitigates reward hacking in on-policy safety alignment when verifiable rewards are unavailable and learned reward models must be used.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends divergence-based interpretations of alignment objectives from preference data to scalar-reward settings in LLM alignment, including RLVR. It introduces f-GRPO (on-policy RL objectives) and f-HAL (hybrid on/off-policy loss) that estimate f-divergences between distributions induced by responses with above- versus below-batch-mean rewards, proves expected reward improvement, and reports empirical gains over GRPO on math-reasoning tasks plus reduced reward hacking in safety alignment.
Significance. If the central claims hold, the work supplies a principled recipe for constructing RL alignment losses via f-divergence estimation that applies uniformly to verifiable rewards and learned reward models. The empirical results on reasoning and safety tasks indicate potential practical value for mitigating reward hacking while retaining on-policy optimization.
major comments (2)
- [Abstract and §4] Abstract and §4 (proof of expected reward improvement): the derivation invokes the standard variational representation of f-divergences (or Pinsker-type bounds) to link divergence estimation to policy improvement, but no regularity conditions on reward boundedness, support overlap, or batch variance are stated. When rewards are sparse or intra-batch variance is low, one induced distribution can have near-zero mass, making the estimator degenerate and the improvement guarantee inapplicable.
- [§3.2] §3.2 (definition of reward-aligned and reward-unaligned distributions): the split is performed by thresholding the same scalar rewards used to form the objective at the batch mean. This construction risks circularity; it is unclear whether the resulting f-divergence remains an independent quantity or collapses to a post-hoc fitted statistic, which would undermine the claim that the objective yields principled improvement.
minor comments (1)
- [§5] Empirical sections should report explicit data-exclusion rules, reward-model training details, and variance across random seeds to permit verification that reported gains are not driven by post-hoc choices.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the theoretical foundations. We address each major point below and indicate planned revisions to clarify assumptions and definitions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (proof of expected reward improvement): the derivation invokes the standard variational representation of f-divergences (or Pinsker-type bounds) to link divergence estimation to policy improvement, but no regularity conditions on reward boundedness, support overlap, or batch variance are stated. When rewards are sparse or intra-batch variance is low, one induced distribution can have near-zero mass, making the estimator degenerate and the improvement guarantee inapplicable.
Authors: We agree the §4 derivation relies on the variational representation of f-divergences and implicitly requires conditions such as bounded rewards, positive support overlap, and non-degenerate intra-batch variance to ensure the induced distributions are well-defined. In the revision we will explicitly list these regularity conditions at the start of §4, add a remark on the scope of the expected improvement guarantee, and discuss practical safeguards (reward normalization, minimum-variance thresholding, or fallback to GRPO) for degenerate batches. This does not alter the core proof but makes its applicability precise. revision: partial
-
Referee: [§3.2] §3.2 (definition of reward-aligned and reward-unaligned distributions): the split is performed by thresholding the same scalar rewards used to form the objective at the batch mean. This construction risks circularity; it is unclear whether the resulting f-divergence remains an independent quantity or collapses to a post-hoc fitted statistic, which would undermine the claim that the objective yields principled improvement.
Authors: The batch-mean split is by design and directly analogous to how preference pairs define aligned/unaligned distributions in prior work; the scalar reward serves as the sole supervision signal that partitions the responses. The f-divergence is a functional of the two policy-induced distributions and is estimated independently of the loss parameters. We will revise §3.2 to state this analogy explicitly, add a short consistency argument showing the estimator does not collapse to a fitted statistic, and clarify that the improvement guarantee follows from the variational representation rather than from any post-hoc fitting. revision: partial
Circularity Check
Reward-aligned/unaligned distributions defined by thresholding the same scalar rewards that the f-GRPO objective optimizes, making the claimed divergence estimation and reward-improvement proof reduce to the modeling choice by construction.
specific steps
-
self definitional
[Abstract]
"We show that these objectives estimate f-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment."
The aligned/unaligned distributions are induced exactly by thresholding the scalar rewards that the RL objective is optimizing. The claim that the objective 'estimates' the f-divergence between these self-defined distributions is therefore true by the construction of the objective and the split; the subsequent 'prove expected reward improvement' follows from the same definitional relation rather than from an independent property of f-divergences.
full rationale
The paper's central derivation defines the aligned distribution as responses with reward above the batch mean and the unaligned as below it, then states that the proposed objectives estimate the f-divergence between these two distributions and proves expected reward improvement. Because the split and the objective both operate directly on the identical reward values, the 'estimation' result and the improvement guarantee are equivalent to the definitional split rather than an independent first-principles derivation. This matches the self-definitional pattern; no external benchmark or non-circular assumption is invoked to break the loop. The remainder of the paper (hybrid loss, empirical results) does not alter this core reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption f-divergences between reward-induced distributions are estimable from on-policy samples and yield monotonic reward improvement
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirho...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Evaluating Large Language Models Trained on Code
Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Reinforcement learning for reasoning in small llms: What works and what doesn’t
URLhttps://arxiv.org/abs/2503.16219. Daniel Han, M. H. and team, U. Unsloth,
-
[5]
KTO: Model Alignment as Prospect Theoretic Optimization
URL http://github.com/unslothai/unsloth. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Llm safety alignment is divergence estimation in disguise
Haldar, R., Wang, Z., Song, Q., Lin, G., and Xing, Y . Llm safety alignment is divergence estimation in disguise. arXiv preprint arXiv:2502.00657,
-
[8]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509,
-
[9]
Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,
-
[10]
Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models
Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044,
-
[11]
Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in LLMs: A representation space analysis. InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, pp. 7067–7085. Association for Computational Linguistics, November 2024a. doi: 10.18653/v1/2024.emnlp-main
-
[12]
URL https://aclanthology.org/2024. emnlp-main.401/. Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis.arXiv preprint arXiv:2406.10794, 2024b. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large languag...
-
[13]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
URL https://arxiv. org/abs/2310.16049. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. Learning to summarize from human feedback. In NeurIPS,
-
[15]
arXiv preprint arXiv:2502.18548 , year=
URL https://qwenlm.github.io/ blog/qwen2.5/. V ojnovic, M. and Yun, S.-Y . What is the alignment objective of grpo?arXiv preprint arXiv:2502.18548,
-
[16]
URL https://arxiv.org/abs/2406.01574. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Me...
work page internal anchor Pith review Pith/arXiv arXiv
- [17]
-
[18]
Instruction-Following Evaluation for Large Language Models
Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Input:optional supervision datasetD sup either Dsup ={(x, y w, yl)} (pairwise) or Dsup ={(x, y, ℓ)} with ℓ∈ {+1,−1}(binary) Initialize parametersθ repeat Sample minibatch of prompts{x b}B b=1 Set behavior policyπ θold ←π θ Initializeg on ←0andg off ←0 On-policy term (F-GRPO): forb= 1toBdo Sample{y b,i}G i=1 ∼π θold(·|xb) Compute rewards rb,i =r(x b, yb,i)...
work page 2025
-
[21]
(Appendix C.3), we quantify this separation using the Bhattacharyya distanceD B between the clusters, which we adopt as a robustness metric. (a)Base, (D B=2.48) (b)λ= 0, D B = 4.47 (c)λ= 0.5, D B=12.13 (d)λ= 1, D B = 9.14 Figure 2.Latent-space separation (Bhattacharyya distance DB) between safe and harmful prompt clusters before and after alignment with f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.