pith. sign in

arxiv: 2504.12501 · v9 · submitted 2025-04-16 · 💻 cs.LG

Reinforcement Learning from Human Feedback

Pith reviewed 2026-05-22 19:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords RLHFreinforcement learning from human feedbackreward modelsinstruction tuningdirect alignmentpreference optimizationhuman feedback
0
0 comments X

The pith

RLHF aligns models by sequencing instruction tuning, reward model training, and optimization through reinforcement learning or direct methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The book walks through the full technical process for using human feedback to improve machine learning systems, beginning with historical roots in economics and philosophy. It supplies the necessary definitions, problem setups, and data practices before breaking down the pipeline into concrete stages. Instruction tuning prepares the base model, a reward model learns from human preferences, and then various algorithms refine outputs to better match those preferences. The account closes by flagging open issues around synthetic data generation and evaluation. A quantitative reader gains a map of how each step fits together to produce deployed aligned systems.

Core claim

RLHF decomposes into a sequence of optimization stages that starts with an instruction-tuned model, moves to training a reward model on collected human preference data, and then applies either rejection sampling, reinforcement learning updates, or direct alignment algorithms to produce the final policy.

What carries the argument

The staged RLHF pipeline that chains instruction tuning to reward modeling and then to policy optimization methods in order to embed human preferences into model behavior.

If this is right

  • Each stage can be tuned independently to improve overall alignment quality.
  • Direct alignment methods offer a shortcut that avoids training an explicit reward model.
  • Rejection sampling and reinforcement learning both serve as post-reward-model refinement techniques.
  • Evaluation of the final aligned model depends on how well the earlier stages captured human intent.
  • Open questions in synthetic data and evaluation directly affect the reliability of the entire pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Treating the pipeline as modular suggests that targeted improvements to any one stage could raise performance across the board without redesigning the others.
  • The emphasis on understudied areas implies that scaling human feedback might shift toward automated data sources sooner than expected.
  • Connections between the optimization stages and classical control theory could inspire new hybrid algorithms not yet explored in the literature.
  • If the pipeline description holds, then mismatches between research prototypes and deployed systems likely stem from implementation details rather than missing stages.

Load-bearing premise

The stages and algorithms presented accurately capture the core technical workflow used in current RLHF research and practical deployments.

What would settle it

A review of recent production systems or research papers that rely on alignment methods outside the described sequence of instruction tuning, reward modeling, and optimization steps would show the account is incomplete.

Figures

Figures reproduced from arXiv: 2504.12501 by Nathan Lambert.

Figure 1
Figure 1. Figure 1: A rendition of the early, three stage RLHF process with SFT, a reward model, [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of key developments in RLHF discussed in this chapter, from early work [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The core RLHF loop from Christiano et al. (2017): the reward predictor is trained [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Standard RL loop 3.1.1 A Simple Example: The Thermostat To build a basic intuition for what RL does, consider a thermostat trying to keep a room at a target temperature of 70◦F. In RL, the agent starts with no knowledge of the task and must discover a good policy through trial and error. The thermostat example has the following components (see fig. 5 for how each maps to the trajectory distribution in eq. … view at source ↗
Figure 5
Figure 5. Figure 5: Each term in the trajectory distribution (eq. 1) mapped to the thermostat RL [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CartPole environment showing state variables ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Standard RLHF loop rlhfbook.com 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A rendition of the early, three stage RLHF process with SFT, a reward model, [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A rendition of modern post-training with many rounds. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A summary of the Tülu 3 recipe with target skills and multi-step training recipe. [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The reward model in RLHF plays the role of the environment component that [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Training a preference reward model requires pairs of chosen and rejected comple [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: At inference time, an outcome reward model outputs per-token correctness [PITH_FULL_IMAGE:figures/full_fig_p043_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Training an outcome reward model uses offline labels from a verifier or dataset [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Process reward models provide supervision only at step boundaries (e.g., newline [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Overview of the RLHF training loop. A prompt from the dataset is passed to the [PITH_FULL_IMAGE:figures/full_fig_p051_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Basic REINFORCE architecture for language models. The shaped reward [PITH_FULL_IMAGE:figures/full_fig_p057_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: REINFORCE Leave-One-Out (RLOO) architecture. Multiple completions per [PITH_FULL_IMAGE:figures/full_fig_p059_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: PPO framework. A learned value function enables Generalized Advantage [PITH_FULL_IMAGE:figures/full_fig_p060_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visualization of the different regions of the PPO objective for a hypothetical [PITH_FULL_IMAGE:figures/full_fig_p061_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Value function training uses on-policy rollouts to compute targets. The model [PITH_FULL_IMAGE:figures/full_fig_p064_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: GRPO architecture. Advantages are normalized relative to the group mean and [PITH_FULL_IMAGE:figures/full_fig_p066_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: A comparison of the generation-update phases for synchronous or asynchronous [PITH_FULL_IMAGE:figures/full_fig_p076_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: An example distributed RL system, where two queues are managed to pass data [PITH_FULL_IMAGE:figures/full_fig_p076_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: RLVR in the form of an RL feedback loop. Instead of a reward model, a [PITH_FULL_IMAGE:figures/full_fig_p090_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: When DPO first released it sparked a fierce debate in the research community [PITH_FULL_IMAGE:figures/full_fig_p101_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Sketch of preference displacement in DPO. [PITH_FULL_IMAGE:figures/full_fig_p107_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Rejection sampling overview. The actual details on which prompts to use, how to select a reward model, how to sequence rejection sampling, etc. are not well documented in the literature. This chapter provides an overview of the methods and leaves further experimentation to the reader. 9.1.1 1. Generating Completions To generate a set of multiple candidate completions per prompt, let’s define a set of M pr… view at source ↗
Figure 29
Figure 29. Figure 29: The timeline of the integration of various subfields into the modern version of [PITH_FULL_IMAGE:figures/full_fig_p117_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: An example of one of the earliest preference data collection interface, from [PITH_FULL_IMAGE:figures/full_fig_p124_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Example preference data collection interface from when I was served two comple [PITH_FULL_IMAGE:figures/full_fig_p125_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Example preference data collection interface from an early version of the popular [PITH_FULL_IMAGE:figures/full_fig_p126_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Example preference data collection interface with up or down arrow from the [PITH_FULL_IMAGE:figures/full_fig_p127_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Example user interface of text-to-image models. [PITH_FULL_IMAGE:figures/full_fig_p128_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Overview of the multi-batch cycle for obtaining human preference data from a [PITH_FULL_IMAGE:figures/full_fig_p132_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Traditional knowledge distillation trains a smaller student model to match the [PITH_FULL_IMAGE:figures/full_fig_p136_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Tool use interleaves model generation with external execution: the model generates [PITH_FULL_IMAGE:figures/full_fig_p148_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Over-optimization of an RL training run vs. downstream evaluations. This is a [PITH_FULL_IMAGE:figures/full_fig_p153_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Over-optimization with a train and test RM from Bai et al. 2022. License CC-BY. [PITH_FULL_IMAGE:figures/full_fig_p157_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Forgetting dynamics for forward KL (SFT) versus reverse KL (RL). The “old” [PITH_FULL_IMAGE:figures/full_fig_p164_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Bias toward KL-minimal solutions reduces forgetting. (Left) Among policies that [PITH_FULL_IMAGE:figures/full_fig_p165_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Report from Epoch AI showing how major AI evaluations are rapidly saturated [PITH_FULL_IMAGE:figures/full_fig_p174_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: The persona vector extraction and intervention pipeline. Top: contrastive [PITH_FULL_IMAGE:figures/full_fig_p180_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: (Left) Vectors corresponding to character archetypes are computed by measuring [PITH_FULL_IMAGE:figures/full_fig_p183_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Results from the paper on Direct Nash Optimization (DNO) highlighting their [PITH_FULL_IMAGE:figures/full_fig_p216_45.png] view at source ↗
read the original abstract

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a book-length educational overview of Reinforcement Learning from Human Feedback (RLHF). It traces origins in recent literature and convergent fields (economics, philosophy, optimal control), introduces definitions, problem formulations, and common mathematical tools, then details the full optimization pipeline from instruction tuning through reward model training, rejection sampling, reinforcement learning, and direct alignment algorithms, before addressing advanced topics in synthetic data and evaluation plus open questions.

Significance. If the descriptions accurately reflect current standard practice, the work could serve as a useful consolidated reference for readers with quantitative backgrounds who need a structured walkthrough of the RLHF pipeline employed in large-scale model deployment. Its value lies in synthesis rather than novel technical claims; no machine-checked proofs, reproducible code, or falsifiable predictions are presented.

minor comments (2)
  1. [Abstract] The abstract states that the core chapters detail 'every optimization stage' and 'all of rejection sampling, reinforcement learning, and direct alignment algorithms.' A more precise scope statement early in the introduction would clarify whether less-common variants (e.g., specific offline RL methods or emerging direct-alignment losses) are omitted for brevity.
  2. [Introduction / Setup] The transition from the origins discussion to the mathematical setup section would benefit from an explicit roadmap paragraph that maps the subsequent chapters to the pipeline stages listed in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript as a consolidated educational reference on RLHF. We appreciate the recommendation for minor revision and will incorporate any necessary clarifications to ensure accuracy in describing current practices.

Circularity Check

0 steps flagged

No significant circularity; descriptive overview of existing RLHF pipeline with no derivations or predictions.

full rationale

This manuscript is an educational book offering a gentle introduction to RLHF methods rather than a research paper advancing novel technical claims or derivations. It describes the standard pipeline (instruction tuning, reward modeling, rejection sampling, reinforcement learning, and direct alignment algorithms) and traces origins to existing literature and fields like economics and optimal control, without presenting any mathematical predictions, fitted parameters renamed as results, or self-referential definitions. No load-bearing steps reduce by construction to inputs, self-citations, or ansatzes; the content is self-contained as a survey of established techniques. This matches the provided reader's assessment of zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an introductory book on established techniques, the work introduces no new free parameters, axioms, or invented entities beyond standard background from machine learning and optimal control.

pith-pipeline@v0.9.0 · 5656 in / 1080 out tokens · 93074 ms · 2026-05-22T19:22:43.359776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning Assisted Quantum Simulation of Many-Body Excited States and Real-Time Dynamics

    quant-ph 2026-05 unverdicted novelty 6.0

    The work generalizes RL-CQE to excited states and time evolution via adaptive operator selection and a constant-scaling ansatz, reporting chemical accuracy on chemical systems with compact representations.

  2. UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

    cs.HC 2026-05 unverdicted novelty 6.0

    UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.

  3. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  4. RewardBench 2: Advancing Reward Model Evaluation

    cs.CL 2025-06 unverdicted novelty 6.0

    RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training per...

  5. Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

    cs.CL 2026-05 unverdicted novelty 5.0

    Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

  6. Beyond Distribution Sharpening: The Importance of Task Rewards

    cs.LG 2026-04 unverdicted novelty 5.0

    Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.

  7. When control meets large language models: From words to dynamics

    eess.SY 2026-02 unverdicted novelty 3.0

    The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 7 Pith papers · 54 internal anchors

  1. [1]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in neural information processing systems, vol. 30, 2017

  2. [2]

    Learning to summarize with human feedback,

    N. Stiennonet al., “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020

  3. [3]

    Training language models to follow instructions with human feedback,

    L. Ouyanget al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730– 27744, 2022

  4. [4]

    WebGPT: Browser-assisted question-answering with human feedback

    R.Nakanoet al., “Webgpt: Browser-assistedquestion-answeringwithhumanfeedback,” arXiv preprint arXiv:2112.09332, 2021

  5. [5]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y. Baiet al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022

  6. [6]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    N. Lambertet al., “Tulu 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024

  7. [7]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    J. Daiet al., “Safe RLHF: Safe reinforcement learning from human feedback,”arXiv preprint arXiv:2310.12773, 2023, Available: https://arxiv.org/abs/2310.12773

  8. [8]

    Understanding the effects of rlhf on llm generalisation and diversity,

    R. Kirket al., “Understanding the effects of rlhf on llm generalisation and diversity,” inInternational conference on learning representations (ICLR), 2024

  9. [9]

    Sft memorizes, rl generalizes: A comparative study of foundation model post-training,

    T. Chuet al., “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,” inInternational conference on machine learning (ICML), 2025

  10. [10]

    A long way to go: Investigating length correlations in rlhf,

    P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A long way to go: Investigating length correlations in rlhf,”arXiv preprint arXiv:2310.03716, 2023

  11. [11]

    Disentangling length from quality in direct preference optimization,

    R. Park, R. Rafailov, S. Ermon, and C. Finn, “Disentangling length from quality in direct preference optimization,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 4998–5017

  12. [12]

    Olmoe: Open mixture-of-experts language models,

    N. Muennighoffet al., “Olmoe: Open mixture-of-experts language models,” inInter- national conference on learning representations (ICLR), 2025

  13. [13]

    OLMoE, meet iOS

    Allen Institute for Artificial Intelligence, “OLMoE, meet iOS.” https://allenai.org/bl og/olmoe-app, 2025

  14. [14]

    Lima: Less is more for alignment,

    C. Zhouet al., “Lima: Less is more for alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guoet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning,”arXiv preprint arXiv:2501.12948, 2025

  16. [17]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    D. Khatriet al., “The art of scaling reinforcement learning compute for llms,”arXiv preprint arXiv:2510.13786, 2025

  17. [18]

    Olmo 3

    T. Olmoet al., “Olmo 3.” 2025. Available: https://arxiv.org/abs/2512.13961

  18. [19]

    Stanford alpaca: An instruction-following LLaMA model,

    R. Taoriet al., “Stanford alpaca: An instruction-following LLaMA model,”GitHub repository. https://github.com/tatsu-lab/stanford_alpaca; GitHub, 2023

  19. [20]

    Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality

    W.-L. Chianget al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.” 2023. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  20. [21]

    Koala: A dialogue model for academic research

    X. Genget al., “Koala: A dialogue model for academic research.” Blog post, 2023. Accessed: Apr. 03, 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04 /03/koala/ rlhfbook.com 188

  21. [22]

    Hello dolly: Democratizing the magic of ChatGPT with open models

    M. Conoveret al., “Hello dolly: Democratizing the magic of ChatGPT with open models.” Accessed: Jun. 30, 2023. [Online]. Available: https://www.databricks.com /blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html

  22. [23]

    A General Language Assistant as a Laboratory for Alignment

    A. Askellet al., “A general language assistant as a laboratory for alignment,”arXiv preprint arXiv:2112.00861, 2021

  23. [24]

    Constitutional AI: Harmlessness from AI Feedback

    Y. Baiet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

  24. [25]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, 2023

  25. [26]

    Zephyr: Direct distillation of LM alignment,

    L. Tunstallet al., “Zephyr: Direct distillation of LM alignment,” inFirst conference on language modeling, 2024. Available: https://openreview.net/forum?id=aKkAwZB6JV

  26. [27]

    Camels in a changing climate: Enhancing lm adaptation with tulu 2,

    H. Ivisonet al., “Camels in a changing climate: Enhancing lm adaptation with tulu 2,”arXiv preprint arXiv:2311.10702, 2023

  27. [28]

    Ultrafeedback: Boosting language models with high-quality feedback,

    G. Cuiet al., “Ultrafeedback: Boosting language models with high-quality feedback,” 2023

  28. [29]

    The Llama 3 Herd of Models

    A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  29. [30]

    Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al

    B. Adleret al., “Nemotron-4 340B technical report,”arXiv preprint arXiv:2406.11704, 2024

  30. [31]

    A survey of preference-based reinforcement learning methods,

    C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey of preference-based reinforcement learning methods,”Journal of Machine Learning Research, vol. 18, no. 136, pp. 1–46, 2017

  31. [32]

    A survey of reinforcement learning from human feedback,

    T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, “A survey of reinforcement learning from human feedback,”Transactions on Machine Learning Research (TMLR), 2025

  32. [33]

    Open problems and fundamental limitations of reinforcement learning from human feedback,

    S. Casperet al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”Transactions on Machine Learning Research (TMLR), 2023

  33. [34]

    Tamer: Training an agent manually via evaluative reinforcement,

    W. B. Knox and P. Stone, “Tamer: Training an agent manually via evaluative reinforcement,” in2008 7th IEEE international conference on development and learning, IEEE, 2008, pp. 292–297

  34. [35]

    Interactive learning from policy-dependent human feedback,

    J. MacGlashanet al., “Interactive learning from policy-dependent human feedback,” inInternational conference on machine learning, PMLR, 2017, pp. 2285–2294

  35. [36]

    Reward learning from human preferences and demonstrations in atari,

    B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei, “Reward learning from human preferences and demonstrations in atari,”Advances in neural information processing systems, vol. 31, 2018

  36. [37]

    Deep tamer: Interactive agent shaping in high-dimensional state spaces,

    G. Warnell, N. Waytowich, V. Lawhern, and P. Stone, “Deep tamer: Interactive agent shaping in high-dimensional state spaces,” inProceedings of the AAAI conference on artificial intelligence, 2018

  37. [38]

    Scalable agent alignment via reward modeling: a research direction

    J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scal- able agent alignment via reward modeling: A research direction,”arXiv preprint arXiv:1811.07871, 2018

  38. [39]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegleret al., “Fine-tuning language models from human preferences,”arXiv preprint arXiv:1909.08593, 2019

  39. [40]

    Recursively Summarizing Books with Human Feedback

    J. Wuet al., “Recursively summarizing books with human feedback,”arXiv preprint arXiv:2109.10862, 2021. rlhfbook.com 189

  40. [41]

    Teaching language models to support answers with verified quotes

    J. Menicket al., “Teaching language models to support answers with verified quotes,” arXiv preprint arXiv:2203.11147, 2022

  41. [42]

    Improving alignment of dialogue agents via targeted human judgements

    A. Glaeseet al., “Improving alignment of dialogue agents via targeted human judge- ments,”arXiv preprint arXiv:2209.14375, 2022

  42. [43]

    Scaling laws for reward model overoptimization,

    L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational conference on machine learning, PMLR, 2023, pp. 10835–10866

  43. [44]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguliet al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022

  44. [45]

    Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization,

    R. Ramamurthyet al., “Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization,” inInternational conference on learning representations (ICLR), 2023

  45. [46]

    TrlX: A framework for large scale reinforcement learning from human feedback,

    A. Havrillaet al., “TrlX: A framework for large scale reinforcement learning from human feedback,” inProceedings of the 2023 conference on empirical methods in natural language processing, Singapore: Association for Computational Linguistics, Dec. 2023, pp. 8578–8595. doi: 10.18653/v1/2023.emnlp-main.530

  46. [47]

    TRL: Transformer reinforcement learning,

    L. von Werraet al., “TRL: Transformer reinforcement learning,”GitHub repository. https://github.com/huggingface/trl; GitHub, 2020

  47. [48]

    ChatGPT: Optimizing language models for dialogue

    OpenAI, “ChatGPT: Optimizing language models for dialogue.” https://openai.com /blog/chatgpt/, 2022

  48. [49]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvronet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  49. [50]

    Let’s verify step by step,

    H. Lightmanet al., “Let’s verify step by step,” inInternational conference on learning representations (ICLR), 2024

  50. [51]

    Training language models to self-correct via reinforcement learning,

    A. Kumaret al., “Training language models to self-correct via reinforcement learning,” inInternational conference on learning representations (ICLR), 2025

  51. [52]

    Beyond human data: Scaling self-training for problem-solving with language models,

    A. Singhet al., “Beyond human data: Scaling self-training for problem-solving with language models,”Transactions on Machine Learning Research (TMLR), 2024

  52. [53]

    Introducing OpenAI o1-preview

    OpenAI, “Introducing OpenAI o1-preview.” Sep. 2024. Available: https://openai.c om/index/introducing-openai-o1-preview/

  53. [54]

    Reinforcement learning: An introduction,

    R. S. Sutton, “Reinforcement learning: An introduction,”A Bradford Book, 2018

  54. [55]

    Illustrating reinforcement learning from human feedback (RLHF),

    N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (RLHF),”Hugging Face Blog, 2022

  55. [56]

    Branch-train-merge: Embarrassingly parallel training of expert language models,

    M. Liet al., “Branch-train-merge: Embarrassingly parallel training of expert language models,”arXiv preprint arXiv:2208.03306, 2022

  56. [57]

    Command a: An enterprise-ready large language model,

    T. Cohereet al., “Command a: An enterprise-ready large language model,”arXiv preprint arXiv:2504.00698, 2025

  57. [58]

    2 OLMo 2 Furious

    T. OLMoet al., “2 OLMo 2 furious,”arXiv preprint arXiv:2501.00656, 2024

  58. [59]

    SmolTulu: Higher learning rate to batch size ratios can lead to better reasoning in SLMs,

    S. Alrashed, “SmolTulu: Higher learning rate to batch size ratios can lead to better reasoning in SLMs,”arXiv preprint arXiv:2412.08347, 2024

  59. [60]

    Qwen3 Technical Report

    A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  60. [61]

    MiMo: Unlocking the reasoning potential of language model–from pretraining to posttraining,

    B. Xiaet al., “MiMo: Unlocking the reasoning potential of language model–from pretraining to posttraining,”arXiv preprint arXiv:2505.07608, 2025

  61. [62]

    Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

    B. Seedet al., “Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning.” 2025. Available: https://arxiv.org/abs/2504.13914

  62. [63]

    Language models are few-shot learners,

    T. Brownet al., “Language models are few-shot learners,”Advances in neural infor- mation processing systems, vol. 33, pp. 1877–1901, 2020. rlhfbook.com 190

  63. [64]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

  64. [65]

    Finetuned language models are zero-shot learners,

    J. Weiet al., “Finetuned language models are zero-shot learners,” inInternational conference on learning representations, 2022. Available: https://openreview.net/for um?id=gEZrGCozdqR

  65. [66]

    Multitask prompted training enables zero-shot task generalization,

    V. Sanhet al., “Multitask prompted training enables zero-shot task generalization,” inInternational conference on learning representations, 2022. Available: https: //openreview.net/forum?id=9Vrb9D0WI4

  66. [67]

    Cross-task generalization via nat- ural language crowdsourcing instructions,

    S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task generalization via nat- ural language crowdsourcing instructions,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), Association for Computational Linguistics, May 2022, pp. 3470–3487. doi: 10.18653/v1/2022.acl- long.244

  67. [68]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruc- tion hierarchy: Training llms to prioritize privileged instructions,”arXiv preprint arXiv:2404.13208, 2024

  68. [69]

    Qlora: Efficient finetun- ing of quantized llms,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetun- ing of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10088–10115, 2023

  69. [70]

    No robots,

    N. Rajani, L. Tunstall, E. Beeching, N. Lambert, A. M. Rush, and T. Wolf, “No robots,”Hugging Face repository. https://huggingface.co/datasets/HuggingFaceH4/ no_robots; Hugging Face, 2023

  70. [71]

    Algorithms for inverse reinforcement learning

    A. Y. Ng, S. Russell,et al., “Algorithms for inverse reinforcement learning.” in Proceedings of the seventeenth international conference on machine learning, in ICML ’00. 2000, pp. 663--670

  71. [72]

    URLhttp://www.jstor.org/ stable/2334029

    R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952, Accessed: Feb. 13, 2023. [Online]. Available: http://www.jstor.org/stable/2334029

  72. [73]

    Starling-7b: Improving helpfulness and harmlessness with rlaif,

    B. Zhuet al., “Starling-7b: Improving helpfulness and harmlessness with rlaif,” in First conference on language modeling, 2024

  73. [74]

    Learning plackett-luce mixtures from partial preferences,

    A. Liu, Z. Zhao, C. Liao, P. Lu, and L. Xia, “Learning plackett-luce mixtures from partial preferences,” inProceedings of the AAAI conference on artificial intelligence, 2019, pp. 4328–4335

  74. [75]

    Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,

    B. Zhu, M. Jordan, and J. Jiao, “Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,” inInternational conference on machine learning, PMLR, 2023, pp. 43037–43067

  75. [76]

    Training Verifiers to Solve Math Word Problems

    K. Cobbeet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021

  76. [77]

    Exploring the limit of outcome reward for learning mathematical reasoning,

    C. Lyuet al., “Exploring the limit of outcome reward for learning mathematical reasoning,”arXiv preprint arXiv:2502.06781, 2025

  77. [78]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zhenget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in Neural Information Processing Systems, vol. 36, pp. 46595–46623, 2023

  78. [79]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto, “Length-controlled alpacae- val: A simple way to debias automatic evaluators,”arXiv preprint arXiv:2404.04475, 2024. rlhfbook.com 191

  79. [80]

    From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline,

    T. Liet al., “From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline,” inInternational conference on machine learning (ICML), 2025

  80. [81]

    WILDBENCH: Benchmarking LLMs with challenging tasks from real users in the wild,

    B. Y. Linet al., “WILDBENCH: Benchmarking LLMs with challenging tasks from real users in the wild,” inInternational conference on learning representations (ICLR), 2025

Showing first 80 references.