arxiv: 2405.07863 · v3 · pith:7LPQM4CBnew · submitted 2024-05-13 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong , Wei Xiong , Bo Pang , Haoxiang Wang , Han Zhao , Yingbo Zhou , Nan Jiang , Doyen Sahoo

show 2 more authors

Caiming Xiong Tong Zhang

This is my paper

Pith reviewed 2026-05-17 16:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords RLHFonline reinforcement learningpreference modelinglarge language modelsopen-source datasetsiterative alignmentchatbot evaluation

0 comments

The pith

Online iterative RLHF using proxy preference models from open-source datasets reaches state-of-the-art results on LLM chatbot benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a reproducible workflow for online iterative RLHF that substitutes real-time human feedback with proxy preference models trained on diverse public datasets. This substitution is introduced because continuous human labeling remains impractical for most open-source groups. The workflow combines supervised fine-tuning with repeated online policy updates guided by the proxy model, and the resulting systems are shown to score at the top of AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, and TruthfulQA. Theoretical discussion of online advantages over offline methods is included, together with practical implementation details and released code. If the proxy model retains enough preference signal, the approach demonstrates that high-quality alignment can proceed without new human data collection.

Core claim

By training a proxy preference model on aggregated open-source preference datasets and then using it inside an online iterative RLHF loop, supervised fine-tuning followed by iterative RLHF produces large language models that attain state-of-the-art performance across standard chatbot and academic benchmarks while relying exclusively on publicly available resources.

What carries the argument

The proxy preference model trained on open-source preference data, which replaces live human feedback during the online sampling-and-update cycle.

Load-bearing premise

The proxy preference model built from open-source datasets approximates real human feedback closely enough that online RLHF updates remain beneficial rather than harmful.

What would settle it

Measure whether benchmark scores after the online RLHF stage exceed those of the SFT baseline and of an offline RLHF baseline; a drop in scores or lower human preference ratings for the online version would falsify the central claim.

read the original abstract

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical engineering report that ships a complete open-source online RLHF pipeline with proxy preference models and all the code and data, which is useful even though the underlying ideas are mostly assembled from prior work.

read the letter

The main thing here is that the authors have documented and released a full end-to-end workflow for online iterative RLHF that starts with training a proxy preference model on public datasets and then runs the online loop without needing fresh human labels at every step. They include theoretical notes, implementation details, and step-by-step guidebooks, plus the actual models and curated data on GitHub. That release is the real contribution for anyone trying to run this kind of training themselves.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a technical report detailing a workflow for online iterative RLHF. It constructs proxy preference models from diverse open-source datasets to approximate human feedback (since direct online human feedback is infeasible for open-source settings), discusses theoretical insights and algorithmic principles, provides practical implementation details, and reports that SFT combined with iterative RLHF achieves state-of-the-art results on benchmarks including AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, and TruthfulQA using fully open-source data. The authors release models, curated datasets, and step-by-step code.

Significance. If the empirical claims hold, the work supplies a reproducible open-source recipe for online iterative RLHF, which the literature reports outperforms offline variants, thereby addressing a clear gap in existing open-source RLHF projects. The public release of models, datasets, and comprehensive code guidebooks is a concrete strength that directly supports community reproducibility and follow-on research.

major comments (2)

[Experiments] Experiments section: benchmark scores are reported without error bars, ablation studies on the proxy model, or direct head-to-head comparison of proxy versus real human feedback quality; this leaves the central performance claim and the assumption that online RLHF updates remain beneficial only partially supported.
[Reward Modeling] Reward modeling and proxy construction sections: no quantitative validation or analysis is provided showing how closely the proxy preference model approximates real human preferences, which is load-bearing for the claim that iterative online RLHF produces net gains rather than harmful updates.

minor comments (2)

[Abstract] Abstract: states 'impressive performance' and 'state-of-the-art' without quoting the exact scores or naming the immediate baselines used for comparison.
[Implementation] Implementation details: some hyperparameter choices and training schedules for the online RLHF loop could be presented more explicitly (e.g., via additional tables or pseudocode) to aid exact reproduction.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on our technical report. We address each major comment below, indicating planned revisions where they strengthen the manuscript without misrepresenting our contributions or the inherent constraints of open-source RLHF research.

read point-by-point responses

Referee: [Experiments] Experiments section: benchmark scores are reported without error bars, ablation studies on the proxy model, or direct head-to-head comparison of proxy versus real human feedback quality; this leaves the central performance claim and the assumption that online RLHF updates remain beneficial only partially supported.

Authors: We agree that error bars and additional ablations would increase confidence in the reported results. In the revised manuscript we will add standard deviations computed over multiple random seeds for the primary AlpacaEval-2, Arena-Hard, and MT-Bench scores, and we will include a new ablation subsection that varies the composition of the proxy training datasets. A direct head-to-head comparison against real human feedback at the scale required for iterative RLHF is not feasible within the open-source setting that motivates the work; we will instead expand the discussion to cite existing literature on the reliability of learned preference models and to clarify the practical rationale for the proxy approach. revision: partial
Referee: [Reward Modeling] Reward modeling and proxy construction sections: no quantitative validation or analysis is provided showing how closely the proxy preference model approximates real human preferences, which is load-bearing for the claim that iterative online RLHF produces net gains rather than harmful updates.

Authors: We acknowledge that an explicit quantitative assessment of proxy-human agreement would better support the claim that the iterative updates are beneficial. In the revision we will add a dedicated subsection that reports the proxy model's accuracy and pairwise agreement on held-out portions of human preference data drawn from the same open-source sources used in construction. This analysis will be tied directly to the observed benchmark gains to address the concern about potentially harmful updates. revision: yes

standing simulated objections not resolved

A full-scale, real-time head-to-head evaluation of proxy versus live human feedback, which remains impractical for the resource-limited open-source setting that the paper targets.

Circularity Check

0 steps flagged

Empirical workflow with no circular derivation chain

full rationale

The paper describes a practical recipe for online iterative RLHF that begins with training proxy preference models on external open-source datasets and then applies standard RL algorithms to produce models evaluated on independent benchmarks (AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, TruthfulQA). All performance numbers are obtained from actual training runs rather than any first-principles derivation or prediction that reduces to fitted parameters by construction. No equations, uniqueness theorems, or ansatzes are presented that collapse back to the paper's own inputs; the central claim of SOTA results with fully open-source data is therefore an empirical observation, not a self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central contribution is an engineering workflow rather than new theory; the main unverified premise is that the proxy preference model is a faithful stand-in for human judgments.

free parameters (1)

proxy preference model training hyperparameters
Standard RLHF training choices that are fitted or tuned on the open-source datasets.

pith-pipeline@v0.9.0 · 5593 in / 1004 out tokens · 103545 ms · 2026-05-17T16:43:20.020674+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback... iterative DPO... rejection sampling with n=8
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J(π) = E[r*(x,a)] − η DKL(π(·|x)∥π0(·|x))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
cs.CR 2026-04 unverdicted novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26...
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
cs.CV 2024-06 unverdicted novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Optimal Transport for LLM Reward Modeling from Noisy Preference
cs.LG 2026-05 unverdicted novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
cs.LG 2026-02 conditional novelty 6.0

Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning
cs.LG 2026-02 unverdicted novelty 6.0

ESSAM matches PPO and GRPO accuracy (~78%) on GSM8K math tasks but uses 10-18x less GPU memory and shows stronger generalization across datasets.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
cs.CV 2024-12 unverdicted novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity
cs.LG 2026-05 unverdicted novelty 5.0

A user-diversity condition is necessary and sufficient for personalized alignment to achieve O(1) online regret and log(1/epsilon) offline sample complexity.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
cs.AI 2024-10 unverdicted novelty 4.0

Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 17 Pith papers

[1]

Flat-leaf parsley is often preferred for its robust flavor

Fresh parsley: This is the main component, providing the fresh, herbaceous flavor and color to the mixture. Flat-leaf parsley is often preferred for its robust flavor

work page
[2]

Garlic: Finely chopped or minced raw garlic gives a sharp, pungent taste to the Gremolata

work page
[3]

In some variations, you might also find:

Lemon zest: The zest from a lemon adds a bright, citrusy flavor and a slight tang to the mixture. In some variations, you might also find:

work page
[4]

Lemon juice: This adds a little extra acidity and freshness to the mix. 5. Fresh basil or mint: These herbs can be used instead of, or in addition to, parsley, depending on the dish and personal preference. To prepare Gremolata, simply mix the ingredients together in a small bowl. The proportions can vary depending on personal taste, but a general guideli...

work page 2024
[5]

Squats: Helps to strengthen the lower body, particularly the glutes, quadriceps, and hamstrings. 2. Push-ups: Improves upper body strength, targeting the chest, triceps, and shoulders. 3. Planks: Engages the core muscles and strengthens the lower back, arms, and glutes. 4. Lunges: Strengthens the quadriceps, hamstrings, and glutes, while improving balance...

work page
[6]

Bicep curls: Targets the biceps, promoting arm strength and definition. 7. Tricep extensions: Strengthens the triceps, an important muscle for overall upper body strength. 8. Walking or jogging: A simple and effective way to improve cardiovascular fitness and burn calories. 9. Bodyweight squats and lunges: These are great for beginners as they only requir...

work page
[7]

Stand with feet shoulder-width apart, lower your hips as if sitting back into a chair, then push back up to the starting position

**Bodyweight Squats**: A fundamental exercise that targets the lower body, including the glutes, hamstrings, and quadriceps. Stand with feet shoulder-width apart, lower your hips as if sitting back into a chair, then push back up to the starting position. Aim for 2-3 sets of 10-15 reps

work page
[8]

Start on your hands and knees, then lower your body until your chest nearly touches the ground, then push back up

**Push-ups**: A classic upper body strengthener that targets the chest, triceps, and shoulders. Start on your hands and knees, then lower your body until your chest nearly touches the ground, then push back up. Modify by starting on your knees if needed. Aim for 2-3 sets of 8-12 reps

work page
[9]

Start in a push-up position, then lower your forearms to the ground, keeping your body in a straight line from head to heels

**Plank**: A core-strengthening exercise that also works the shoulders, back, and glutes. Start in a push-up position, then lower your forearms to the ground, keeping your body in a straight line from head to heels. Hold for 30 seconds to 1 minute, then rest and repeat for 2-3 sets

work page
[10]

Stand with feet together, then step forward with one foot, lowering your hips until both knees are bent at a 90-degree angle

**Lunges**: A lower body exercise that targets the glutes, hamstrings, and quadriceps. Stand with feet together, then step forward with one foot, lowering your hips until both knees are bent at a 90-degree angle. Push back to the starting position and repeat on the other side. Aim for 2-3 sets of 8-12 reps per leg

work page
[11]

Hold a dumbbell in each hand, bend forward at the hips with a straight back, then row the weights up to your sides, squeezing your shoulder blades together

**Bent-over Row**: A back and shoulder strengthener that targets the latissimus dorsi, rhomboids, and traps. Hold a dumbbell in each hand, bend forward at the hips with a straight back, then row the weights up to your sides, squeezing your shoulder blades together. Lower the weights and repeat for 2-3 sets of 8-12 reps

work page
[12]

Start with a brisk walk, gradually increasing speed and duration as you become more comfortable

**Walking or Jogging**: A great way to improve cardiovascular fitness and burn calories. Start with a brisk walk, gradually increasing speed and duration as you become more comfortable. Aim for at least 30 minutes of moderate-intensity exercise, such as brisk walking, 5 days a week

work page
[13]

Start with beginner classes or videos, focusing on proper form and breathing

**Yoga or Pilates**: These low-impact exercises are excellent for improving flexibility, balance, and core strength. Start with beginner classes or videos, focusing on proper form and breathing. Aim for at least 2-3 sessions per week

work page
[14]

Stand on the edge of a step or platform, then raise your heels as high as possible, then lower back down

**Calf Raises**: A lower body exercise that targets the calf muscles. Stand on the edge of a step or platform, then raise your heels as high as possible, then lower back down. Aim for 2-3 sets of 10-15 reps. Remember to warm up before starting any exercise, and always listen to your body. If an exercise causes pain or discomfort, stop and consult a health...

work page 2024
[15]

It typically contains a blend of sand, coconut coir, perlite, and peat moss or sphagnum moss

Cactus-specific soil mix: This is a commercial soil mix designed specifically for cactus plants. It typically contains a blend of sand, coconut coir, perlite, and peat moss or sphagnum moss. This mix provides excellent drainage and aeration while retaining moisture

work page
[16]

It usually consists of a blend of sand, perlite, andor vermiculite

Desert soil mix: A desert soil mix can also be suitable for cactus plants. It usually consists of a blend of sand, perlite, andor vermiculite. This mix is well-draining and allows water to escape quickly

work page
[17]

It provides excellent drainage and allows water to escape quickly

Sandy soil: A sandy soil with a high percentage of sand is suitable for cactus plants. It provides excellent drainage and allows water to escape quickly

work page
[18]

This mimics the natural environment of cactus plants in the desert and allows water to drain away quickly

Gravel and cactus mix: Layering a cactus mix on top of a layer of small rocks or gravel can also be a good option. This mimics the natural environment of cactus plants in the desert and allows water to drain away quickly. It’s essential to avoid using heavy, clay-based soils, as they can retain too much water, leading to root rot and other issues for cact...

work page
[19]

This is usually achieved by using materials like sand, perlite, or pumice

**Coarse Texture**: The soil should have a coarse texture, allowing water to drain quickly. This is usually achieved by using materials like sand, perlite, or pumice. These materials help prevent waterlogged roots and root rot, which are common issues for cacti

work page
[20]

It retains some moisture while still allowing for proper drainage

**Loamy Soil**: A loamy soil that is a combination of sand, silt, and clay is also suitable. It retains some moisture while still allowing for proper drainage. Adding some coarse materials like perlite or pumice to the loamy soil can enhance drainage further

work page
[21]

Therefore, a soil with low nutrient content, such as a cactus-specific potting mix, is ideal

**Low Nutrient Content**: Cacti have shallow root systems and do not require a lot of nutrients. Therefore, a soil with low nutrient content, such as a cactus-specific potting mix, is ideal. Don’t use regular garden soil, as it may have too many nutrients and can lead to excessive growth and weak stems

work page
[22]

This range is generally neutral and does not require additional adjustments

**Acidic to Slightly Alkaline**: Cactus plants prefer slightly acidic to slightly alkaline soil, with a pH range of 6.0 to 7.5. This range is generally neutral and does not require additional adjustments

work page
[23]

This promotes healthy root growth and prevents issues like root rot

**Dry and Aerated**: The soil should be dry on the surface to prevent moisture buildup, but still aerated enough to allow oxygen to reach the roots. This promotes healthy root growth and prevents issues like root rot. To create a suitable cactus soil mix, you can use a combination of the following ingredients: - 2 parts cactus or succulent potting mix (wh...

work page