arXiv preprint arXiv:2409.15360 , year=

Reward-robust rlhf in llms , author= · 2024 · arXiv 2409.15360

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

cs.AI · 2026-05-01 · unverdicted · novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

DRRO for RLHF minimizes worst-case regret relative to the best policy under Wasserstein reward perturbations, yielding an exact inner solution and water-filling policy structure for the promptwise simplex model plus a practical policy-gradient algorithm.

Learning to Control Summaries with Score Ranking

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 4 of 4 citing papers.

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs cs.AI · 2026-05-01 · unverdicted · none · ref 47
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback cs.LG · 2026-04-30 · unverdicted · none · ref 20
DRRO for RLHF minimizes worst-case regret relative to the best policy under Wasserstein reward perturbations, yielding an exact inner solution and water-filling policy structure for the promptwise simplex model plus a practical policy-gradient algorithm.
Learning to Control Summaries with Score Ranking cs.CL · 2026-04-19 · unverdicted · none · ref 46
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 64
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

arXiv preprint arXiv:2409.15360 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer