Reinforcement Learning from Human Feedback
Pith reviewed 2026-05-22 19:22 UTC · model grok-4.3
The pith
RLHF aligns models by sequencing instruction tuning, reward model training, and optimization through reinforcement learning or direct methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RLHF decomposes into a sequence of optimization stages that starts with an instruction-tuned model, moves to training a reward model on collected human preference data, and then applies either rejection sampling, reinforcement learning updates, or direct alignment algorithms to produce the final policy.
What carries the argument
The staged RLHF pipeline that chains instruction tuning to reward modeling and then to policy optimization methods in order to embed human preferences into model behavior.
If this is right
- Each stage can be tuned independently to improve overall alignment quality.
- Direct alignment methods offer a shortcut that avoids training an explicit reward model.
- Rejection sampling and reinforcement learning both serve as post-reward-model refinement techniques.
- Evaluation of the final aligned model depends on how well the earlier stages captured human intent.
- Open questions in synthetic data and evaluation directly affect the reliability of the entire pipeline.
Where Pith is reading between the lines
- Treating the pipeline as modular suggests that targeted improvements to any one stage could raise performance across the board without redesigning the others.
- The emphasis on understudied areas implies that scaling human feedback might shift toward automated data sources sooner than expected.
- Connections between the optimization stages and classical control theory could inspire new hybrid algorithms not yet explored in the literature.
- If the pipeline description holds, then mismatches between research prototypes and deployed systems likely stem from implementation details rather than missing stages.
Load-bearing premise
The stages and algorithms presented accurately capture the core technical workflow used in current RLHF research and practical deployments.
What would settle it
A review of recent production systems or research papers that rely on alignment methods outside the described sequence of instruction tuning, reward modeling, and optimization steps would show the account is incomplete.
Figures
read the original abstract
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a book-length educational overview of Reinforcement Learning from Human Feedback (RLHF). It traces origins in recent literature and convergent fields (economics, philosophy, optimal control), introduces definitions, problem formulations, and common mathematical tools, then details the full optimization pipeline from instruction tuning through reward model training, rejection sampling, reinforcement learning, and direct alignment algorithms, before addressing advanced topics in synthetic data and evaluation plus open questions.
Significance. If the descriptions accurately reflect current standard practice, the work could serve as a useful consolidated reference for readers with quantitative backgrounds who need a structured walkthrough of the RLHF pipeline employed in large-scale model deployment. Its value lies in synthesis rather than novel technical claims; no machine-checked proofs, reproducible code, or falsifiable predictions are presented.
minor comments (2)
- [Abstract] The abstract states that the core chapters detail 'every optimization stage' and 'all of rejection sampling, reinforcement learning, and direct alignment algorithms.' A more precise scope statement early in the introduction would clarify whether less-common variants (e.g., specific offline RL methods or emerging direct-alignment losses) are omitted for brevity.
- [Introduction / Setup] The transition from the origins discussion to the mathematical setup section would benefit from an explicit roadmap paragraph that maps the subsequent chapters to the pipeline stages listed in the abstract.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript as a consolidated educational reference on RLHF. We appreciate the recommendation for minor revision and will incorporate any necessary clarifications to ensure accuracy in describing current practices.
Circularity Check
No significant circularity; descriptive overview of existing RLHF pipeline with no derivations or predictions.
full rationale
This manuscript is an educational book offering a gentle introduction to RLHF methods rather than a research paper advancing novel technical claims or derivations. It describes the standard pipeline (instruction tuning, reward modeling, rejection sampling, reinforcement learning, and direct alignment algorithms) and traces origins to existing literature and fields like economics and optimal control, without presenting any mathematical predictions, fitted parameters renamed as results, or self-referential definitions. No load-bearing steps reduce by construction to inputs, self-citations, or ansatzes; the content is self-contained as a survey of established techniques. This matches the provided reader's assessment of zero circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 7 Pith papers
-
Reinforcement Learning Assisted Quantum Simulation of Many-Body Excited States and Real-Time Dynamics
The work generalizes RL-CQE to excited states and time evolution via adaptive operator selection and a constant-scaling ansatz, reporting chemical accuracy on chemical systems with compact representations.
-
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization
UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
RewardBench 2: Advancing Reward Model Evaluation
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training per...
-
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
When control meets large language models: From words to dynamics
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
Reference graph
Works this paper leans on
-
[1]
Deep reinforcement learning from human preferences,
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[2]
Learning to summarize with human feedback,
N. Stiennonet al., “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020
work page 2020
-
[3]
Training language models to follow instructions with human feedback,
L. Ouyanget al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730– 27744, 2022
work page 2022
-
[4]
WebGPT: Browser-assisted question-answering with human feedback
R.Nakanoet al., “Webgpt: Browser-assistedquestion-answeringwithhumanfeedback,” arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y. Baiet al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
N. Lambertet al., “Tulu 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
J. Daiet al., “Safe RLHF: Safe reinforcement learning from human feedback,”arXiv preprint arXiv:2310.12773, 2023, Available: https://arxiv.org/abs/2310.12773
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Understanding the effects of rlhf on llm generalisation and diversity,
R. Kirket al., “Understanding the effects of rlhf on llm generalisation and diversity,” inInternational conference on learning representations (ICLR), 2024
work page 2024
-
[9]
Sft memorizes, rl generalizes: A comparative study of foundation model post-training,
T. Chuet al., “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,” inInternational conference on machine learning (ICML), 2025
work page 2025
-
[10]
A long way to go: Investigating length correlations in rlhf,
P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A long way to go: Investigating length correlations in rlhf,”arXiv preprint arXiv:2310.03716, 2023
-
[11]
Disentangling length from quality in direct preference optimization,
R. Park, R. Rafailov, S. Ermon, and C. Finn, “Disentangling length from quality in direct preference optimization,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 4998–5017
work page 2024
-
[12]
Olmoe: Open mixture-of-experts language models,
N. Muennighoffet al., “Olmoe: Open mixture-of-experts language models,” inInter- national conference on learning representations (ICLR), 2025
work page 2025
-
[13]
Allen Institute for Artificial Intelligence, “OLMoE, meet iOS.” https://allenai.org/bl og/olmoe-app, 2025
work page 2025
-
[14]
Lima: Less is more for alignment,
C. Zhouet al., “Lima: Less is more for alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023
work page 2023
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guoet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
The Art of Scaling Reinforcement Learning Compute for LLMs
D. Khatriet al., “The art of scaling reinforcement learning compute for llms,”arXiv preprint arXiv:2510.13786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
T. Olmoet al., “Olmo 3.” 2025. Available: https://arxiv.org/abs/2512.13961
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Stanford alpaca: An instruction-following LLaMA model,
R. Taoriet al., “Stanford alpaca: An instruction-following LLaMA model,”GitHub repository. https://github.com/tatsu-lab/stanford_alpaca; GitHub, 2023
work page 2023
-
[20]
Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality
W.-L. Chianget al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.” 2023. Available: https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[21]
Koala: A dialogue model for academic research
X. Genget al., “Koala: A dialogue model for academic research.” Blog post, 2023. Accessed: Apr. 03, 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04 /03/koala/ rlhfbook.com 188
work page 2023
-
[22]
Hello dolly: Democratizing the magic of ChatGPT with open models
M. Conoveret al., “Hello dolly: Democratizing the magic of ChatGPT with open models.” Accessed: Jun. 30, 2023. [Online]. Available: https://www.databricks.com /blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
work page 2023
-
[23]
A General Language Assistant as a Laboratory for Alignment
A. Askellet al., “A general language assistant as a laboratory for alignment,”arXiv preprint arXiv:2112.00861, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Constitutional AI: Harmlessness from AI Feedback
Y. Baiet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, 2023
work page 2023
-
[26]
Zephyr: Direct distillation of LM alignment,
L. Tunstallet al., “Zephyr: Direct distillation of LM alignment,” inFirst conference on language modeling, 2024. Available: https://openreview.net/forum?id=aKkAwZB6JV
work page 2024
-
[27]
Camels in a changing climate: Enhancing lm adaptation with tulu 2,
H. Ivisonet al., “Camels in a changing climate: Enhancing lm adaptation with tulu 2,”arXiv preprint arXiv:2311.10702, 2023
-
[28]
Ultrafeedback: Boosting language models with high-quality feedback,
G. Cuiet al., “Ultrafeedback: Boosting language models with high-quality feedback,” 2023
work page 2023
-
[29]
A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
B. Adleret al., “Nemotron-4 340B technical report,”arXiv preprint arXiv:2406.11704, 2024
-
[31]
A survey of preference-based reinforcement learning methods,
C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey of preference-based reinforcement learning methods,”Journal of Machine Learning Research, vol. 18, no. 136, pp. 1–46, 2017
work page 2017
-
[32]
A survey of reinforcement learning from human feedback,
T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, “A survey of reinforcement learning from human feedback,”Transactions on Machine Learning Research (TMLR), 2025
work page 2025
-
[33]
Open problems and fundamental limitations of reinforcement learning from human feedback,
S. Casperet al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”Transactions on Machine Learning Research (TMLR), 2023
work page 2023
-
[34]
Tamer: Training an agent manually via evaluative reinforcement,
W. B. Knox and P. Stone, “Tamer: Training an agent manually via evaluative reinforcement,” in2008 7th IEEE international conference on development and learning, IEEE, 2008, pp. 292–297
work page 2008
-
[35]
Interactive learning from policy-dependent human feedback,
J. MacGlashanet al., “Interactive learning from policy-dependent human feedback,” inInternational conference on machine learning, PMLR, 2017, pp. 2285–2294
work page 2017
-
[36]
Reward learning from human preferences and demonstrations in atari,
B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei, “Reward learning from human preferences and demonstrations in atari,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[37]
Deep tamer: Interactive agent shaping in high-dimensional state spaces,
G. Warnell, N. Waytowich, V. Lawhern, and P. Stone, “Deep tamer: Interactive agent shaping in high-dimensional state spaces,” inProceedings of the AAAI conference on artificial intelligence, 2018
work page 2018
-
[38]
Scalable agent alignment via reward modeling: a research direction
J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scal- able agent alignment via reward modeling: A research direction,”arXiv preprint arXiv:1811.07871, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
Fine-Tuning Language Models from Human Preferences
D. M. Ziegleret al., “Fine-tuning language models from human preferences,”arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[40]
Recursively Summarizing Books with Human Feedback
J. Wuet al., “Recursively summarizing books with human feedback,”arXiv preprint arXiv:2109.10862, 2021. rlhfbook.com 189
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Teaching language models to support answers with verified quotes
J. Menicket al., “Teaching language models to support answers with verified quotes,” arXiv preprint arXiv:2203.11147, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Improving alignment of dialogue agents via targeted human judgements
A. Glaeseet al., “Improving alignment of dialogue agents via targeted human judge- ments,”arXiv preprint arXiv:2209.14375, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Scaling laws for reward model overoptimization,
L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational conference on machine learning, PMLR, 2023, pp. 10835–10866
work page 2023
-
[44]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
D. Ganguliet al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,”arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
R. Ramamurthyet al., “Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization,” inInternational conference on learning representations (ICLR), 2023
work page 2023
-
[46]
TrlX: A framework for large scale reinforcement learning from human feedback,
A. Havrillaet al., “TrlX: A framework for large scale reinforcement learning from human feedback,” inProceedings of the 2023 conference on empirical methods in natural language processing, Singapore: Association for Computational Linguistics, Dec. 2023, pp. 8578–8595. doi: 10.18653/v1/2023.emnlp-main.530
-
[47]
TRL: Transformer reinforcement learning,
L. von Werraet al., “TRL: Transformer reinforcement learning,”GitHub repository. https://github.com/huggingface/trl; GitHub, 2020
work page 2020
-
[48]
ChatGPT: Optimizing language models for dialogue
OpenAI, “ChatGPT: Optimizing language models for dialogue.” https://openai.com /blog/chatgpt/, 2022
work page 2022
-
[49]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvronet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
H. Lightmanet al., “Let’s verify step by step,” inInternational conference on learning representations (ICLR), 2024
work page 2024
-
[51]
Training language models to self-correct via reinforcement learning,
A. Kumaret al., “Training language models to self-correct via reinforcement learning,” inInternational conference on learning representations (ICLR), 2025
work page 2025
-
[52]
Beyond human data: Scaling self-training for problem-solving with language models,
A. Singhet al., “Beyond human data: Scaling self-training for problem-solving with language models,”Transactions on Machine Learning Research (TMLR), 2024
work page 2024
-
[53]
OpenAI, “Introducing OpenAI o1-preview.” Sep. 2024. Available: https://openai.c om/index/introducing-openai-o1-preview/
work page 2024
-
[54]
Reinforcement learning: An introduction,
R. S. Sutton, “Reinforcement learning: An introduction,”A Bradford Book, 2018
work page 2018
-
[55]
Illustrating reinforcement learning from human feedback (RLHF),
N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (RLHF),”Hugging Face Blog, 2022
work page 2022
-
[56]
Branch-train-merge: Embarrassingly parallel training of expert language models,
M. Liet al., “Branch-train-merge: Embarrassingly parallel training of expert language models,”arXiv preprint arXiv:2208.03306, 2022
-
[57]
Command a: An enterprise-ready large language model,
T. Cohereet al., “Command a: An enterprise-ready large language model,”arXiv preprint arXiv:2504.00698, 2025
-
[58]
T. OLMoet al., “2 OLMo 2 furious,”arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
SmolTulu: Higher learning rate to batch size ratios can lead to better reasoning in SLMs,
S. Alrashed, “SmolTulu: Higher learning rate to batch size ratios can lead to better reasoning in SLMs,”arXiv preprint arXiv:2412.08347, 2024
-
[60]
A. Yanget al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
MiMo: Unlocking the reasoning potential of language model–from pretraining to posttraining,
B. Xiaet al., “MiMo: Unlocking the reasoning potential of language model–from pretraining to posttraining,”arXiv preprint arXiv:2505.07608, 2025
-
[62]
Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning
B. Seedet al., “Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning.” 2025. Available: https://arxiv.org/abs/2504.13914
-
[63]
Language models are few-shot learners,
T. Brownet al., “Language models are few-shot learners,”Advances in neural infor- mation processing systems, vol. 33, pp. 1877–1901, 2020. rlhfbook.com 190
work page 1901
-
[64]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffelet al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020
work page 2020
-
[65]
Finetuned language models are zero-shot learners,
J. Weiet al., “Finetuned language models are zero-shot learners,” inInternational conference on learning representations, 2022. Available: https://openreview.net/for um?id=gEZrGCozdqR
work page 2022
-
[66]
Multitask prompted training enables zero-shot task generalization,
V. Sanhet al., “Multitask prompted training enables zero-shot task generalization,” inInternational conference on learning representations, 2022. Available: https: //openreview.net/forum?id=9Vrb9D0WI4
work page 2022
-
[67]
Cross-task generalization via nat- ural language crowdsourcing instructions,
S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task generalization via nat- ural language crowdsourcing instructions,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), Association for Computational Linguistics, May 2022, pp. 3470–3487. doi: 10.18653/v1/2022.acl- long.244
-
[68]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruc- tion hierarchy: Training llms to prioritize privileged instructions,”arXiv preprint arXiv:2404.13208, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Qlora: Efficient finetun- ing of quantized llms,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetun- ing of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10088–10115, 2023
work page 2023
-
[70]
N. Rajani, L. Tunstall, E. Beeching, N. Lambert, A. M. Rush, and T. Wolf, “No robots,”Hugging Face repository. https://huggingface.co/datasets/HuggingFaceH4/ no_robots; Hugging Face, 2023
work page 2023
-
[71]
Algorithms for inverse reinforcement learning
A. Y. Ng, S. Russell,et al., “Algorithms for inverse reinforcement learning.” in Proceedings of the seventeenth international conference on machine learning, in ICML ’00. 2000, pp. 663--670
work page 2000
-
[72]
URLhttp://www.jstor.org/ stable/2334029
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952, Accessed: Feb. 13, 2023. [Online]. Available: http://www.jstor.org/stable/2334029
-
[73]
Starling-7b: Improving helpfulness and harmlessness with rlaif,
B. Zhuet al., “Starling-7b: Improving helpfulness and harmlessness with rlaif,” in First conference on language modeling, 2024
work page 2024
-
[74]
Learning plackett-luce mixtures from partial preferences,
A. Liu, Z. Zhao, C. Liao, P. Lu, and L. Xia, “Learning plackett-luce mixtures from partial preferences,” inProceedings of the AAAI conference on artificial intelligence, 2019, pp. 4328–4335
work page 2019
-
[75]
Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,
B. Zhu, M. Jordan, and J. Jiao, “Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,” inInternational conference on machine learning, PMLR, 2023, pp. 43037–43067
work page 2023
-
[76]
Training Verifiers to Solve Math Word Problems
K. Cobbeet al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[77]
Exploring the limit of outcome reward for learning mathematical reasoning,
C. Lyuet al., “Exploring the limit of outcome reward for learning mathematical reasoning,”arXiv preprint arXiv:2502.06781, 2025
-
[78]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zhenget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in Neural Information Processing Systems, vol. 36, pp. 46595–46623, 2023
work page 2023
-
[79]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto, “Length-controlled alpacae- val: A simple way to debias automatic evaluators,”arXiv preprint arXiv:2404.04475, 2024. rlhfbook.com 191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline,
T. Liet al., “From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline,” inInternational conference on machine learning (ICML), 2025
work page 2025
-
[81]
WILDBENCH: Benchmarking LLMs with challenging tasks from real users in the wild,
B. Y. Linet al., “WILDBENCH: Benchmarking LLMs with challenging tasks from real users in the wild,” inInternational conference on learning representations (ICLR), 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.