RLHF Workflow: From Reward Modeling to Online RLHF
Pith reviewed 2026-05-17 16:43 UTC · model grok-4.3
The pith
Online iterative RLHF using proxy preference models from open-source datasets reaches state-of-the-art results on LLM chatbot benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a proxy preference model on aggregated open-source preference datasets and then using it inside an online iterative RLHF loop, supervised fine-tuning followed by iterative RLHF produces large language models that attain state-of-the-art performance across standard chatbot and academic benchmarks while relying exclusively on publicly available resources.
What carries the argument
The proxy preference model trained on open-source preference data, which replaces live human feedback during the online sampling-and-update cycle.
Load-bearing premise
The proxy preference model built from open-source datasets approximates real human feedback closely enough that online RLHF updates remain beneficial rather than harmful.
What would settle it
Measure whether benchmark scores after the online RLHF stage exceed those of the SFT baseline and of an offline RLHF baseline; a drop in scores or lower human preference ratings for the online version would falsify the central claim.
read the original abstract
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a technical report detailing a workflow for online iterative RLHF. It constructs proxy preference models from diverse open-source datasets to approximate human feedback (since direct online human feedback is infeasible for open-source settings), discusses theoretical insights and algorithmic principles, provides practical implementation details, and reports that SFT combined with iterative RLHF achieves state-of-the-art results on benchmarks including AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, and TruthfulQA using fully open-source data. The authors release models, curated datasets, and step-by-step code.
Significance. If the empirical claims hold, the work supplies a reproducible open-source recipe for online iterative RLHF, which the literature reports outperforms offline variants, thereby addressing a clear gap in existing open-source RLHF projects. The public release of models, datasets, and comprehensive code guidebooks is a concrete strength that directly supports community reproducibility and follow-on research.
major comments (2)
- [Experiments] Experiments section: benchmark scores are reported without error bars, ablation studies on the proxy model, or direct head-to-head comparison of proxy versus real human feedback quality; this leaves the central performance claim and the assumption that online RLHF updates remain beneficial only partially supported.
- [Reward Modeling] Reward modeling and proxy construction sections: no quantitative validation or analysis is provided showing how closely the proxy preference model approximates real human preferences, which is load-bearing for the claim that iterative online RLHF produces net gains rather than harmful updates.
minor comments (2)
- [Abstract] Abstract: states 'impressive performance' and 'state-of-the-art' without quoting the exact scores or naming the immediate baselines used for comparison.
- [Implementation] Implementation details: some hyperparameter choices and training schedules for the online RLHF loop could be presented more explicitly (e.g., via additional tables or pseudocode) to aid exact reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our technical report. We address each major comment below, indicating planned revisions where they strengthen the manuscript without misrepresenting our contributions or the inherent constraints of open-source RLHF research.
read point-by-point responses
-
Referee: [Experiments] Experiments section: benchmark scores are reported without error bars, ablation studies on the proxy model, or direct head-to-head comparison of proxy versus real human feedback quality; this leaves the central performance claim and the assumption that online RLHF updates remain beneficial only partially supported.
Authors: We agree that error bars and additional ablations would increase confidence in the reported results. In the revised manuscript we will add standard deviations computed over multiple random seeds for the primary AlpacaEval-2, Arena-Hard, and MT-Bench scores, and we will include a new ablation subsection that varies the composition of the proxy training datasets. A direct head-to-head comparison against real human feedback at the scale required for iterative RLHF is not feasible within the open-source setting that motivates the work; we will instead expand the discussion to cite existing literature on the reliability of learned preference models and to clarify the practical rationale for the proxy approach. revision: partial
-
Referee: [Reward Modeling] Reward modeling and proxy construction sections: no quantitative validation or analysis is provided showing how closely the proxy preference model approximates real human preferences, which is load-bearing for the claim that iterative online RLHF produces net gains rather than harmful updates.
Authors: We acknowledge that an explicit quantitative assessment of proxy-human agreement would better support the claim that the iterative updates are beneficial. In the revision we will add a dedicated subsection that reports the proxy model's accuracy and pairwise agreement on held-out portions of human preference data drawn from the same open-source sources used in construction. This analysis will be tied directly to the observed benchmark gains to address the concern about potentially harmful updates. revision: yes
- A full-scale, real-time head-to-head evaluation of proxy versus live human feedback, which remains impractical for the resource-limited open-source setting that the paper targets.
Circularity Check
Empirical workflow with no circular derivation chain
full rationale
The paper describes a practical recipe for online iterative RLHF that begins with training proxy preference models on external open-source datasets and then applies standard RL algorithms to produce models evaluated on independent benchmarks (AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, TruthfulQA). All performance numbers are obtained from actual training runs rather than any first-principles derivation or prediction that reduces to fitted parameters by construction. No equations, uniqueness theorems, or ansatzes are presented that collapse back to the paper's own inputs; the central claim of SOTA results with fully open-source data is therefore an empirical observation, not a self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- proxy preference model training hyperparameters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback... iterative DPO... rejection sampling with n=8
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
J(π) = E[r*(x,a)] − η DKL(π(·|x)∥π0(·|x))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
-
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
-
ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
ESSAM matches PPO and GRPO accuracy (~78%) on GSM8K math tasks but uses 10-18x less GPU memory and shows stronger generalization across datasets.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity
A user-diversity condition is necessary and sufficient for personalized alignment to achieve O(1) online regret and log(1/epsilon) offline sample complexity.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
Reference graph
Works this paper leans on
-
[1]
Flat-leaf parsley is often preferred for its robust flavor
Fresh parsley: This is the main component, providing the fresh, herbaceous flavor and color to the mixture. Flat-leaf parsley is often preferred for its robust flavor
-
[2]
Garlic: Finely chopped or minced raw garlic gives a sharp, pungent taste to the Gremolata
-
[3]
In some variations, you might also find:
Lemon zest: The zest from a lemon adds a bright, citrusy flavor and a slight tang to the mixture. In some variations, you might also find:
-
[4]
Lemon juice: This adds a little extra acidity and freshness to the mix. 5. Fresh basil or mint: These herbs can be used instead of, or in addition to, parsley, depending on the dish and personal preference. To prepare Gremolata, simply mix the ingredients together in a small bowl. The proportions can vary depending on personal taste, but a general guideli...
work page 2024
-
[5]
Squats: Helps to strengthen the lower body, particularly the glutes, quadriceps, and hamstrings. 2. Push-ups: Improves upper body strength, targeting the chest, triceps, and shoulders. 3. Planks: Engages the core muscles and strengthens the lower back, arms, and glutes. 4. Lunges: Strengthens the quadriceps, hamstrings, and glutes, while improving balance...
-
[6]
Bicep curls: Targets the biceps, promoting arm strength and definition. 7. Tricep extensions: Strengthens the triceps, an important muscle for overall upper body strength. 8. Walking or jogging: A simple and effective way to improve cardiovascular fitness and burn calories. 9. Bodyweight squats and lunges: These are great for beginners as they only requir...
-
[7]
**Bodyweight Squats**: A fundamental exercise that targets the lower body, including the glutes, hamstrings, and quadriceps. Stand with feet shoulder-width apart, lower your hips as if sitting back into a chair, then push back up to the starting position. Aim for 2-3 sets of 10-15 reps
-
[8]
**Push-ups**: A classic upper body strengthener that targets the chest, triceps, and shoulders. Start on your hands and knees, then lower your body until your chest nearly touches the ground, then push back up. Modify by starting on your knees if needed. Aim for 2-3 sets of 8-12 reps
-
[9]
**Plank**: A core-strengthening exercise that also works the shoulders, back, and glutes. Start in a push-up position, then lower your forearms to the ground, keeping your body in a straight line from head to heels. Hold for 30 seconds to 1 minute, then rest and repeat for 2-3 sets
-
[10]
**Lunges**: A lower body exercise that targets the glutes, hamstrings, and quadriceps. Stand with feet together, then step forward with one foot, lowering your hips until both knees are bent at a 90-degree angle. Push back to the starting position and repeat on the other side. Aim for 2-3 sets of 8-12 reps per leg
-
[11]
**Bent-over Row**: A back and shoulder strengthener that targets the latissimus dorsi, rhomboids, and traps. Hold a dumbbell in each hand, bend forward at the hips with a straight back, then row the weights up to your sides, squeezing your shoulder blades together. Lower the weights and repeat for 2-3 sets of 8-12 reps
-
[12]
Start with a brisk walk, gradually increasing speed and duration as you become more comfortable
**Walking or Jogging**: A great way to improve cardiovascular fitness and burn calories. Start with a brisk walk, gradually increasing speed and duration as you become more comfortable. Aim for at least 30 minutes of moderate-intensity exercise, such as brisk walking, 5 days a week
-
[13]
Start with beginner classes or videos, focusing on proper form and breathing
**Yoga or Pilates**: These low-impact exercises are excellent for improving flexibility, balance, and core strength. Start with beginner classes or videos, focusing on proper form and breathing. Aim for at least 2-3 sessions per week
-
[14]
**Calf Raises**: A lower body exercise that targets the calf muscles. Stand on the edge of a step or platform, then raise your heels as high as possible, then lower back down. Aim for 2-3 sets of 10-15 reps. Remember to warm up before starting any exercise, and always listen to your body. If an exercise causes pain or discomfort, stop and consult a health...
work page 2024
-
[15]
It typically contains a blend of sand, coconut coir, perlite, and peat moss or sphagnum moss
Cactus-specific soil mix: This is a commercial soil mix designed specifically for cactus plants. It typically contains a blend of sand, coconut coir, perlite, and peat moss or sphagnum moss. This mix provides excellent drainage and aeration while retaining moisture
-
[16]
It usually consists of a blend of sand, perlite, andor vermiculite
Desert soil mix: A desert soil mix can also be suitable for cactus plants. It usually consists of a blend of sand, perlite, andor vermiculite. This mix is well-draining and allows water to escape quickly
-
[17]
It provides excellent drainage and allows water to escape quickly
Sandy soil: A sandy soil with a high percentage of sand is suitable for cactus plants. It provides excellent drainage and allows water to escape quickly
-
[18]
Gravel and cactus mix: Layering a cactus mix on top of a layer of small rocks or gravel can also be a good option. This mimics the natural environment of cactus plants in the desert and allows water to drain away quickly. It’s essential to avoid using heavy, clay-based soils, as they can retain too much water, leading to root rot and other issues for cact...
-
[19]
This is usually achieved by using materials like sand, perlite, or pumice
**Coarse Texture**: The soil should have a coarse texture, allowing water to drain quickly. This is usually achieved by using materials like sand, perlite, or pumice. These materials help prevent waterlogged roots and root rot, which are common issues for cacti
-
[20]
It retains some moisture while still allowing for proper drainage
**Loamy Soil**: A loamy soil that is a combination of sand, silt, and clay is also suitable. It retains some moisture while still allowing for proper drainage. Adding some coarse materials like perlite or pumice to the loamy soil can enhance drainage further
-
[21]
Therefore, a soil with low nutrient content, such as a cactus-specific potting mix, is ideal
**Low Nutrient Content**: Cacti have shallow root systems and do not require a lot of nutrients. Therefore, a soil with low nutrient content, such as a cactus-specific potting mix, is ideal. Don’t use regular garden soil, as it may have too many nutrients and can lead to excessive growth and weak stems
-
[22]
This range is generally neutral and does not require additional adjustments
**Acidic to Slightly Alkaline**: Cactus plants prefer slightly acidic to slightly alkaline soil, with a pH range of 6.0 to 7.5. This range is generally neutral and does not require additional adjustments
-
[23]
This promotes healthy root growth and prevents issues like root rot
**Dry and Aerated**: The soil should be dry on the surface to prevent moisture buildup, but still aerated enough to allow oxygen to reach the roots. This promotes healthy root growth and prevents issues like root rot. To create a suitable cactus soil mix, you can use a combination of the following ingredients: - 2 parts cactus or succulent potting mix (wh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.