f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Guang Lin; Lantao Mei; Qifan Song; Rajdeep Haldar; Yue Xing

arxiv: 2602.05946 · v3 · submitted 2026-02-05 · 💻 cs.LG · stat.ML

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Rajdeep Haldar , Lantao Mei , Guang Lin , Yue Xing , Qifan Song This is my paper

Pith reviewed 2026-05-16 06:46 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords f-GRPOf-divergenceLLM alignmentreinforcement learningRLVRreward hackingpolicy optimizationhybrid loss

0 comments

The pith

f-GRPO estimates f-divergences between high- and low-reward response distributions to guide LLM alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends divergence-based views of preference alignment to reinforcement learning with only scalar rewards. It defines f-GRPO as on-policy objectives that treat responses above and below the batch-mean reward as samples from aligned and unaligned distributions, then estimates their f-divergence. A hybrid variant mixes this with off-policy preference data. The authors prove that following these objectives produces expected reward gains, and show empirical gains on math reasoning plus reduced reward hacking in safety settings.

Core claim

f-Group Relative Policy Optimization (f-GRPO) and f-Hybrid Alignment Loss (f-HAL) estimate f-divergences between reward-aligned and reward-unaligned distributions induced by above-average and below-average reward responses; optimizing them yields expected reward improvement after alignment, extending prior divergence interpretations beyond preference supervision to general LLM alignment including RLVR.

What carries the argument

f-GRPO, an on-policy objective that estimates f-divergences by thresholding scalar rewards at the batch mean to separate aligned and unaligned response distributions.

Load-bearing premise

Scalar rewards can be thresholded at the batch mean to induce well-behaved aligned and unaligned distributions whose f-divergence is a useful alignment objective.

What would settle it

An experiment on a math-reasoning RLVR task where f-GRPO optimization produces no measurable increase in average reward or no detectable f-divergence between the induced high- and low-reward distributions.

Figures

Figures reproduced from arXiv: 2602.05946 by Guang Lin, Lantao Mei, Qifan Song, Rajdeep Haldar, Yue Xing.

**Figure 1.** Figure 1: Divergence Estimation Framework. RLVR (left): A verifiable reward signal r(x, y) induces reward-aligned/unaligned distributions (above D + r vs. below-average reward D − r under the old policy), and f -GRPO performs on-policy alignment by estimating an f-divergence between these distributions. Preference alignment (right): preference data samples chosen and rejected prompt-response pairs from aligned (D +)… view at source ↗

**Figure 2.** Figure 2: illustrates this visualization for the Qwen-7B base model and its aligned variants. Alignment induces substantially stronger separation between safe and harmful clusters, a phenomenon that has been shown to correlate with increased robustness. Following the procedure of (Haldar et al., 2025) (Appendix C.3), we quantify this separation using the Bhattacharyya distance DB between the clusters, which we adopt… view at source ↗

read the original abstract

Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy RL objectives, and $f$-Hybrid Alignment Loss ($f$-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate $f$-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment. Empirically, $f$-GRPO improves over GRPO on math-reasoning RLVR tasks, while hybrid $f$-HAL mitigates reward hacking in on-policy safety alignment when verifiable rewards are unavailable and learned reward models must be used.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends divergence estimators to scalar-reward LLM alignment via f-GRPO and f-HAL, but the improvement proof likely needs explicit conditions on reward overlap and scale.

read the letter

The core move is taking the preference-as-divergence view and lifting it to scalar rewards for RLVR and hybrid cases. They introduce f-GRPO as an on-policy objective that estimates an f-divergence between responses above and below the batch-mean reward, plus f-HAL that mixes in off-policy preferences. The abstract claims this yields expected reward improvement and shows gains over GRPO on math tasks plus reduced reward hacking in safety settings when using learned models. That framing is the actual novelty and it gives a clean recipe for designing new losses. The empirical side looks straightforward and targeted, which is helpful for post-training work. The soft spot is the central assumption that mean-thresholding produces two well-behaved distributions whose divergence reliably drives improvement. If rewards are sparse or batch variance is low, one side can have near-zero mass and the estimator becomes unstable; the abstract does not list regularity conditions or overlap requirements, so the proof sketch may not cover common RLVR regimes. Circularity is also possible since the split uses the same reward values that define the objective. Without the full derivations and reward-model details it is hard to judge how tight the argument is. This paper is for people already working on alignment losses who want a divergence lens on reward-based methods. It is worth a serious referee because the idea is coherent and the experiments are relevant, even if the theory section will need tightening on the conditions for the improvement guarantee.

Referee Report

2 major / 1 minor

Summary. The paper extends divergence-based interpretations of alignment objectives from preference data to scalar-reward settings in LLM alignment, including RLVR. It introduces f-GRPO (on-policy RL objectives) and f-HAL (hybrid on/off-policy loss) that estimate f-divergences between distributions induced by responses with above- versus below-batch-mean rewards, proves expected reward improvement, and reports empirical gains over GRPO on math-reasoning tasks plus reduced reward hacking in safety alignment.

Significance. If the central claims hold, the work supplies a principled recipe for constructing RL alignment losses via f-divergence estimation that applies uniformly to verifiable rewards and learned reward models. The empirical results on reasoning and safety tasks indicate potential practical value for mitigating reward hacking while retaining on-policy optimization.

major comments (2)

[Abstract and §4] Abstract and §4 (proof of expected reward improvement): the derivation invokes the standard variational representation of f-divergences (or Pinsker-type bounds) to link divergence estimation to policy improvement, but no regularity conditions on reward boundedness, support overlap, or batch variance are stated. When rewards are sparse or intra-batch variance is low, one induced distribution can have near-zero mass, making the estimator degenerate and the improvement guarantee inapplicable.
[§3.2] §3.2 (definition of reward-aligned and reward-unaligned distributions): the split is performed by thresholding the same scalar rewards used to form the objective at the batch mean. This construction risks circularity; it is unclear whether the resulting f-divergence remains an independent quantity or collapses to a post-hoc fitted statistic, which would undermine the claim that the objective yields principled improvement.

minor comments (1)

[§5] Empirical sections should report explicit data-exclusion rules, reward-model training details, and variance across random seeds to permit verification that reported gains are not driven by post-hoc choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the theoretical foundations. We address each major point below and indicate planned revisions to clarify assumptions and definitions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (proof of expected reward improvement): the derivation invokes the standard variational representation of f-divergences (or Pinsker-type bounds) to link divergence estimation to policy improvement, but no regularity conditions on reward boundedness, support overlap, or batch variance are stated. When rewards are sparse or intra-batch variance is low, one induced distribution can have near-zero mass, making the estimator degenerate and the improvement guarantee inapplicable.

Authors: We agree the §4 derivation relies on the variational representation of f-divergences and implicitly requires conditions such as bounded rewards, positive support overlap, and non-degenerate intra-batch variance to ensure the induced distributions are well-defined. In the revision we will explicitly list these regularity conditions at the start of §4, add a remark on the scope of the expected improvement guarantee, and discuss practical safeguards (reward normalization, minimum-variance thresholding, or fallback to GRPO) for degenerate batches. This does not alter the core proof but makes its applicability precise. revision: partial
Referee: [§3.2] §3.2 (definition of reward-aligned and reward-unaligned distributions): the split is performed by thresholding the same scalar rewards used to form the objective at the batch mean. This construction risks circularity; it is unclear whether the resulting f-divergence remains an independent quantity or collapses to a post-hoc fitted statistic, which would undermine the claim that the objective yields principled improvement.

Authors: The batch-mean split is by design and directly analogous to how preference pairs define aligned/unaligned distributions in prior work; the scalar reward serves as the sole supervision signal that partitions the responses. The f-divergence is a functional of the two policy-induced distributions and is estimated independently of the loss parameters. We will revise §3.2 to state this analogy explicitly, add a short consistency argument showing the estimator does not collapse to a fitted statistic, and clarify that the improvement guarantee follows from the variational representation rather than from any post-hoc fitting. revision: partial

Circularity Check

1 steps flagged

Reward-aligned/unaligned distributions defined by thresholding the same scalar rewards that the f-GRPO objective optimizes, making the claimed divergence estimation and reward-improvement proof reduce to the modeling choice by construction.

specific steps

self definitional [Abstract]
"We show that these objectives estimate f-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment."

The aligned/unaligned distributions are induced exactly by thresholding the scalar rewards that the RL objective is optimizing. The claim that the objective 'estimates' the f-divergence between these self-defined distributions is therefore true by the construction of the objective and the split; the subsequent 'prove expected reward improvement' follows from the same definitional relation rather than from an independent property of f-divergences.

full rationale

The paper's central derivation defines the aligned distribution as responses with reward above the batch mean and the unaligned as below it, then states that the proposed objectives estimate the f-divergence between these two distributions and proves expected reward improvement. Because the split and the objective both operate directly on the identical reward values, the 'estimation' result and the improvement guarantee are equivalent to the definitional split rather than an independent first-principles derivation. This matches the self-definitional pattern; no external benchmark or non-circular assumption is invoked to break the loop. The remainder of the paper (hybrid loss, empirical results) does not alter this core reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on (1) the existence of a well-defined f-divergence between reward-thresholded distributions and (2) the validity of the on-policy gradient for that divergence; both are extensions of prior divergence-alignment work rather than new axioms.

axioms (1)

domain assumption f-divergences between reward-induced distributions are estimable from on-policy samples and yield monotonic reward improvement
Invoked when the paper states that f-GRPO estimates the divergence and proves expected reward improvement.

pith-pipeline@v0.9.0 · 5499 in / 1279 out tokens · 28748 ms · 2026-05-16T06:46:15.481901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirho...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Reinforcement learning for reasoning in small llms: What works and what doesn’t

URLhttps://arxiv.org/abs/2503.16219. Daniel Han, M. H. and team, U. Unsloth,

work page arXiv
[5]

KTO: Model Alignment as Prospect Theoretic Optimization

URL http://github.com/unslothai/unsloth. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Llm safety alignment is divergence estimation in disguise

Haldar, R., Wang, Z., Song, Q., Lin, G., and Xing, Y . Llm safety alignment is divergence estimation in disguise. arXiv preprint arXiv:2502.00657,

work page arXiv
[8]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509,

work page arXiv
[9]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

work page arXiv
[10]

Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044,

work page arXiv
[11]

The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems

Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in LLMs: A representation space analysis. InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, pp. 7067–7085. Association for Computational Linguistics, November 2024a. doi: 10.18653/v1/2024.emnlp-main

work page doi:10.18653/v1/2024.emnlp-main 2024
[12]

emnlp-main.401/

URL https://aclanthology.org/2024. emnlp-main.401/. Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis.arXiv preprint arXiv:2406.10794, 2024b. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large languag...

work page arXiv 2024
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei

URL https://arxiv. org/abs/2310.16049. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. Learning to summarize from human feedback. In NeurIPS,

work page arXiv
[15]

arXiv preprint arXiv:2502.18548 , year=

URL https://qwenlm.github.io/ blog/qwen2.5/. V ojnovic, M. and Yun, S.-Y . What is the alignment objective of grpo?arXiv preprint arXiv:2502.18548,

work page arXiv
[16]

URL https://arxiv.org/abs/2406.01574. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Me...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Zheng, F

Zheng, C., Yin, F., Zhou, H., Meng, F., Zhou, J., Chang, K.-W., Huang, M., and Peng, N. On prompt-driven safeguarding for large language models.arXiv preprint arXiv:2401.18018,

work page arXiv
[18]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

(4)) Computeˆw± b,i Eq

Input:optional supervision datasetD sup either Dsup ={(x, y w, yl)} (pairwise) or Dsup ={(x, y, ℓ)} with ℓ∈ {+1,−1}(binary) Initialize parametersθ repeat Sample minibatch of prompts{x b}B b=1 Set behavior policyπ θold ←π θ Initializeg on ←0andg off ←0 On-policy term (F-GRPO): forb= 1toBdo Sample{y b,i}G i=1 ∼π θold(·|xb) Compute rewards rb,i =r(x b, yb,i)...

work page 2025
[21]

What color is the sky?

(Appendix C.3), we quantify this separation using the Bhattacharyya distanceD B between the clusters, which we adopt as a robustness metric. (a)Base, (D B=2.48) (b)λ= 0, D B = 4.47 (c)λ= 0.5, D B=12.13 (d)λ= 1, D B = 9.14 Figure 2.Latent-space separation (Bhattacharyya distance DB) between safe and harmful prompt clusters before and after alignment with f...

work page arXiv 2037

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirho...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Reinforcement learning for reasoning in small llms: What works and what doesn’t

URLhttps://arxiv.org/abs/2503.16219. Daniel Han, M. H. and team, U. Unsloth,

work page arXiv

[5] [5]

KTO: Model Alignment as Prospect Theoretic Optimization

URL http://github.com/unslothai/unsloth. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Llm safety alignment is divergence estimation in disguise

Haldar, R., Wang, Z., Song, Q., Lin, G., and Xing, Y . Llm safety alignment is divergence estimation in disguise. arXiv preprint arXiv:2502.00657,

work page arXiv

[8] [8]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509,

work page arXiv

[9] [9]

Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

Jiang, D., Lu, Y ., Li, Z., Lyu, Z., Nie, P., Wang, H., Su, A., Chen, H., Zou, K., Du, C., et al. Verltool: Towards holis- tic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055,

work page arXiv

[10] [10]

Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models

Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y ., and Shao, J. Salad-bench: A hierarchical and com- prehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044,

work page arXiv

[11] [11]

The illusion of competence: Evaluating the effect of explanations on users’ men- tal models of visual question answering systems

Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in LLMs: A representation space analysis. InProceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing, pp. 7067–7085. Association for Computational Linguistics, November 2024a. doi: 10.18653/v1/2024.emnlp-main

work page doi:10.18653/v1/2024.emnlp-main 2024

[12] [12]

emnlp-main.401/

URL https://aclanthology.org/2024. emnlp-main.401/. Lin, Y ., He, P., Xu, H., Xing, Y ., Yamada, M., Liu, H., and Tang, J. Towards understanding jailbreak attacks in llms: A representation space analysis.arXiv preprint arXiv:2406.10794, 2024b. Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large languag...

work page arXiv 2024

[13] [13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei

URL https://arxiv. org/abs/2310.16049. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. Learning to summarize from human feedback. In NeurIPS,

work page arXiv

[15] [15]

arXiv preprint arXiv:2502.18548 , year=

URL https://qwenlm.github.io/ blog/qwen2.5/. V ojnovic, M. and Yun, S.-Y . What is the alignment objective of grpo?arXiv preprint arXiv:2502.18548,

work page arXiv

[16] [16]

URL https://arxiv.org/abs/2406.01574. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Me...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Zheng, F

Zheng, C., Yin, F., Zhou, H., Meng, F., Zhou, J., Chang, K.-W., Huang, M., and Peng, N. On prompt-driven safeguarding for large language models.arXiv preprint arXiv:2401.18018,

work page arXiv

[18] [18]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

(4)) Computeˆw± b,i Eq

Input:optional supervision datasetD sup either Dsup ={(x, y w, yl)} (pairwise) or Dsup ={(x, y, ℓ)} with ℓ∈ {+1,−1}(binary) Initialize parametersθ repeat Sample minibatch of prompts{x b}B b=1 Set behavior policyπ θold ←π θ Initializeg on ←0andg off ←0 On-policy term (F-GRPO): forb= 1toBdo Sample{y b,i}G i=1 ∼π θold(·|xb) Compute rewards rb,i =r(x b, yb,i)...

work page 2025

[21] [21]

What color is the sky?

(Appendix C.3), we quantify this separation using the Bhattacharyya distanceD B between the clusters, which we adopt as a robustness metric. (a)Base, (D B=2.48) (b)λ= 0, D B = 4.47 (c)λ= 0.5, D B=12.13 (d)λ= 1, D B = 9.14 Figure 2.Latent-space separation (Bhattacharyya distance DB) between safe and harmful prompt clusters before and after alignment with f...

work page arXiv 2037