pith. machine review for the scientific record. sign in

arxiv: 2504.12501 · v9 · submitted 2025-04-16 · 💻 cs.LG

Recognition: unknown

Reinforcement Learning from Human Feedback

Authors on Pith no claims yet
classification 💻 cs.LG
keywords booklearningreinforcementrlhfcoredatafeedbackhuman
0
0 comments X
read the original abstract

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

    cs.HC 2026-05 unverdicted novelty 6.0

    UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.

  2. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  3. Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

    cs.CL 2026-05 unverdicted novelty 5.0

    Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

  4. Beyond Distribution Sharpening: The Importance of Task Rewards

    cs.LG 2026-04 unverdicted novelty 5.0

    Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.