{"total":10,"items":[{"citing_arxiv_id":"2605.18721","ref_index":5,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"General Preference Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15113","ref_index":7,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning from Language Feedback via Variational Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14539","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards","primary_cat":"cs.CL","submitted_at":"2026-05-14T08:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11461","ref_index":9,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-12T03:20:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Group Relative Policy Optimization.We briefly review GRPO [10], which serves as the optimiza- tion backbone for RLVR in reasoning tasks. For each queryq, the old policy modelπ θold samplesG candidate responses {o1, . . . , oG}. Each response is assigned a verifiable binary reward ri ∈ {0,1} based on matching against the ground-truth answer, which helps mitigate reward hacking [9]. The policyπ θ is updated by maximizing: JGRPO(θ) =E q∼D,{oi}G i=1∼πθold(·|q)   1 G GX i=1 |oi|X t=1 CLIP(ρi,t, Ai)−βKL(π θ||πref)   ,(1) where ρi,t = πθ(oi,t|q,oi,<t) πθold(oi,t|q,oi,<t) is the importance weight. The objective employs PPO-style clipping CLIP(ρi,t, Ai)=min(ρ i,tAi,clip(ρ i,t,1−ϵ,1+ϵ)A i) for trust-region updates [26], and a KL penalty"},{"citing_arxiv_id":"2605.09808","ref_index":91,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants","primary_cat":"cs.CL","submitted_at":"2026-05-10T23:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[89] Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. [90] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835-10866. PMLR, 2023. [91] Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sushil Sikchi, Joey Hejna, Brad Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms.Advances in Neural Information Processing Systems, 37: 126207-126242, 2024. [92] Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy,"},{"citing_arxiv_id":"2605.16345","ref_index":2,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Goal-Conditioned Supervised Learning for LLM Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-08T01:55:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GCSL reframes LLM fine-tuning as supervised pursuit of quality thresholds using natural-language goals, outperforming SFT and DPO on toxicity, code, and recommendation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16339","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:48:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05750","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RVPO: Risk-Sensitive Alignment via Variance Regularization","primary_cat":"cs.LG","submitted_at":"2026-05-07T06:43:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13602","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07818","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DanceGRPO: Unleashing GRPO on Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730-27744, 2022. [49] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023. [50] Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835-10866. PMLR, 2023. [51] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning"}],"limit":50,"offset":0}