RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2022 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.