Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

cs.CL · 2025-08-06 · unverdicted · novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

Reinforcement Learning for LLM Post-Training: A Survey

cs.CL · 2024-07-23 · unverdicted · novelty 3.0

A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.

citing papers explorer

Showing 3 of 3 citing papers.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing cs.CL · 2024-06-12 · unverdicted · none · ref 153
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 43
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
Reinforcement Learning for LLM Post-Training: A Survey cs.CL · 2024-07-23 · unverdicted · none · ref 8
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

fields

years

verdicts

representative citing papers

citing papers explorer