arxiv: 2509.19104 · v2 · submitted 2025-09-23 · 💻 cs.LG · stat.ML

Online Distributionally Robust LLM Alignment via Regression to Relative Reward

Sharan Sahu , Martin T. Wells This is my paper

Pith reviewed 2026-05-18 14:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords distributionally robust optimizationLLM alignmentRLHFrelative reward regressionpreference shiftonline learningDRO-REBELrobust alignment

0 comments p. Extension

The pith

DRO-REBEL performs distributionally robust online alignment of large language models by reducing each update to a relative-reward regression problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a family of online robust updates called DRO-REBEL that incorporate distributionally robust optimization into REBEL-style alignment for LLMs. It handles different ambiguity sets based on Wasserstein, KL, and chi-squared divergences while preserving scalability through strong duality that converts the robust problem into ordinary relative-reward regression. Under linear rewards, log-linear policies, and a coverage condition, the work derives statistical error bounds that improve on earlier DRO-DPO results and achieve a parametric rate under preference shift that matches standard non-robust RLHF when the shift is mild. The resulting algorithms are simple SGD procedures, one for each divergence type, and they demonstrate better performance than prior robust and non-robust baselines on multiple alignment benchmarks with unseen preference mixtures.

Core claim

Under linear rewards, log-linear policies, and a standard coverage condition, DRO-REBEL achieves O~(sqrt(d/n)) bounds on squared parameter error with sharper constants than prior DRO-DPO analyses, and the first parametric O~(d/n) rate for DRO-based alignment under preference shift, matching non-robust RLHF in benign regimes. Each divergence yields a tractable SGD-based algorithm through gradient regularization, importance weighting, or a one-dimensional dual solve.

What carries the argument

Strong duality that converts each distributionally robust update into a relative-reward regression, allowing online SGD without PPO clipping or value networks for Wasserstein, KL, and chi-squared ambiguity sets.

If this is right

Each divergence produces a distinct, simple SGD algorithm: gradient regularization for Wasserstein, importance weighting for KL, and a one-dimensional dual solve for chi-squared.
The method yields the first parametric rate for DRO-based alignment under preference shift while matching non-robust RLHF rates in benign cases.
DRO-REBEL outperforms prior robust and non-robust baselines on Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment across unseen preference mixtures and varying model sizes.
The approach avoids PPO-style clipping and value networks while retaining REBEL scalability for large-scale online alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If linear reward models approximate real LLM preferences only locally, the regression reduction could still be applied in a piecewise manner to maintain robustness without full retraining.
The coverage condition suggests that data collection strategies focused on broad preference coverage would directly improve the practical convergence rate of these robust updates.
Extending the same duality reduction to non-linear reward models or transformer policies could test whether the scalability benefits persist beyond the linear-log-linear setting.

Load-bearing premise

The analysis assumes rewards are linear in features, policies are log-linear, and the data satisfies a standard coverage condition.

What would settle it

Measuring whether squared parameter error scales as O(d/n) rather than slower when training on synthetic linear-reward preference data that includes controlled distribution shifts would directly test the parametric rate.

Figures

Figures reproduced from arXiv: 2509.19104 by Martin T. Wells, Sharan Sahu.

**Figure 2.** Figure 2: Illustration of the DRO–REBEL update loop. At each iteration, (1) we draw a batch of preference tuples [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Radius–coverage trade-off under χ 2 mixture uncertainty (log–linear policy; Gaussian-mixture simulator). Left: Empirical coverage Pr{pb ∈ Bεn (p ◦ )} for calibrated radii εn = χ 2 K−1,α/n (α ∈ {0.50, 0.90, 0.95}) and a fast baseline εn ∝ n −2 . Calibrated schedules track their nominal coverage across n, whereas n −2 rapidly under-covers. Right: Parameter error ∥ ˆθ − θ ⋆∥2 with n −1/2 and n −1/4 slope guid… view at source ↗

**Figure 4.** Figure 4: At fixed n, sweeping ε = c/n traces a monotone frontier between coverage and excess worst-case risk against χ 2 mixture shifts. Calibrated choices (e.g., χ 2 K−1,0.90/n) sit near a knee of the curve, balancing coverage and estimation error [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Emotion alignment performance under convex (left) and geometric (right) reward mixing. Models are trained [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Performance on five ArmoRM objectives (Correctness, Helpfulness, Honesty, ArmoRM, Coherence), includ [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Emotion alignment performance for W-REBEL under convex (left) and geometric (right) reward mixing. [PITH_FULL_IMAGE:figures/full_fig_p061_7.png] view at source ↗

**Figure 8.** Figure 8: Emotion alignment performance for KL-REBEL under convex (left) and geometric (right) reward mixing. [PITH_FULL_IMAGE:figures/full_fig_p062_8.png] view at source ↗

**Figure 9.** Figure 9: Emotion alignment performance for χ 2 -REBEL under convex (left) and geometric (right) reward mixing. M.1 Radius Coverage Setup Simulator, data-generating process, and model class. We use a controlled Gaussian–mixture environment with K=15 latent groups in ambient dimension d=12. The ground-truth mixture p ◦∈∆K−1 is drawn once from Dir(0.3·1K) where ∆K−1 = ( q ∈ R k : qk ≥ 0, X K k=1 qk = 1) is the standar… view at source ↗

read the original abstract

Reinforcement Learning with Human Feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where language models degrade by overfitting inaccuracies and drifting from preferred behaviors observed during training. Distributionally robust optimization (DRO) is a natural solution, but existing DRO-DPO methods are sample-inefficient, ignore heterogeneous preferences, and lean on brittle heuristics. We introduce \emph{DRO-REBEL}, a family of robust online REBEL updates built on type-$p$ Wasserstein, Kullback-Leibler (KL), and $\chi^2$ ambiguity sets. Strong duality reduces each update to a relative-reward regression, retaining REBEL's scalability without PPO-style clipping or value networks. Under linear rewards, log-linear policies, and a standard coverage condition, we prove $\widetilde{O}(\sqrt{d/n})$ bounds on squared parameter error, with sharper constants than prior DRO-DPO analyses, and give the first parametric $\widetilde{O}(d/n)$ rate for DRO-based alignment under preference shift, matching non-robust RLHF in benign regimes. Each divergence yields a tractable SGD-based algorithm: gradient regularization for Wasserstein, importance weighting for KL, and a 1-D dual solve for $\chi^2$. On Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment, DRO-REBEL outperforms prior robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DRO-REBEL, a family of online distributionally robust REBEL updates for LLM alignment that employ type-p Wasserstein, KL, and χ² ambiguity sets. Strong duality reduces each DRO update to a relative-reward regression problem, yielding scalable SGD-based algorithms (gradient regularization for Wasserstein, importance weighting for KL, and a 1-D dual solve for χ²) without PPO-style clipping or value networks. Under linear rewards, log-linear policies, and a standard coverage condition, the paper proves Õ(√(d/n)) bounds on squared parameter error with sharper constants than prior DRO-DPO work and the first parametric Õ(d/n) rate for DRO-based alignment under preference shift. Experiments on Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment report outperformance over robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.

Significance. If the duality reductions and rate proofs hold, the work supplies a practical, scalable route to distributionally robust online alignment that handles heterogeneous preferences while recovering non-robust rates in benign regimes. The explicit scoping to linear rewards and log-linear policies, together with the reported sharper constants and first parametric rate under shift, would constitute a clear advance over existing DRO-DPO analyses.

major comments (2)

[§4, Theorem 1] §4 (Theoretical Analysis), Theorem 1: the Õ(√(d/n)) squared-parameter-error bound is stated under a coverage condition; the dependence of the leading constant on the coverage parameter (or minimal eigenvalue of the covariance) should be made explicit so that the improvement over prior DRO-DPO constants can be directly compared.
[§5, Table 2] §5 (Experiments), Table 2: the reported gains on ArmoRM are shown for a fixed ambiguity-set radius; an ablation varying the radius across the three divergences would strengthen the claim that the method is robust to hyper-parameter choice.

minor comments (2)

[§3.2] §3.2: the notation for the relative-reward target in the regression formulation should be unified across the three divergence cases to avoid reader confusion.
[Figure 3] Figure 3: axis labels and legend entries for the preference-shift curves are too small; increasing font size would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review, positive summary, and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [§4, Theorem 1] §4 (Theoretical Analysis), Theorem 1: the Õ(√(d/n)) squared-parameter-error bound is stated under a coverage condition; the dependence of the leading constant on the coverage parameter (or minimal eigenvalue of the covariance) should be made explicit so that the improvement over prior DRO-DPO constants can be directly compared.

Authors: We agree that explicitly displaying the dependence on the coverage parameter (via the minimal eigenvalue of the covariance) will make the claimed improvement in constants over prior DRO-DPO work directly verifiable. In the revised manuscript we will restate Theorem 1 with the leading constant written as C(λ_min) · √(d/n), where λ_min denotes the minimal eigenvalue under the coverage condition, and we will annotate the proof sketch in the appendix to isolate this factor. revision: yes
Referee: [§5, Table 2] §5 (Experiments), Table 2: the reported gains on ArmoRM are shown for a fixed ambiguity-set radius; an ablation varying the radius across the three divergences would strengthen the claim that the method is robust to hyper-parameter choice.

Authors: We concur that varying the radius across the three ambiguity sets would strengthen the robustness claim. We will add a new ablation subsection (or supplementary table) that sweeps the radius for Wasserstein, KL, and χ² on the ArmoRM benchmark while keeping all other experimental settings fixed, and we will report the resulting performance curves. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from explicit assumptions

full rationale

The paper's central theoretical results are explicitly scoped to linear rewards, log-linear policies, and a standard coverage condition, under which O~(sqrt(d/n)) squared parameter error bounds and O~(d/n) rates are derived. These are standard parametric analyses rather than reductions to fitted quantities or self-citations. The strong duality reduction to relative-reward regression is presented as a mechanism preserving scalability, with no evidence that the claimed rates collapse to inputs by construction. Empirical results are separated from the rate claims. This is a self-contained derivation against the stated assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard RLHF modeling assumptions plus DRO ambiguity sets whose radii are not detailed in the abstract; no new entities are postulated.

free parameters (1)

ambiguity set radius
Size of Wasserstein, KL, or chi-squared balls is a tunable parameter that controls robustness level and must be chosen or fitted for each application.

axioms (2)

domain assumption Linear rewards and log-linear policies
Invoked explicitly for the O~(sqrt(d/n)) and O~(d/n) parameter error bounds.
domain assumption Standard coverage condition
Required to obtain the stated convergence rates on squared parameter error.

pith-pipeline@v0.9.0 · 5802 in / 1341 out tokens · 50808 ms · 2026-05-18T14:42:20.937262+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Strong duality reduces each update to a relative-reward regression... Under linear rewards, log-linear policies... prove O~(sqrt(d/n)) bounds on squared parameter error... first parametric O~(d/n) rate for DRO-based alignment
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ℓREBEL(z;θ) = (1/η [log πθ(a1|x)/πt(a1|x) - ...] - [r(x,a1)-r(x,a2)])²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

ISBN 1581138385

Association for Computing Machinery. ISBN 1581138385. doi: 10.1145/1015330.1015430. URL https: //doi.org/10.1145/1015330.1015430. Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun.Reinforcement Learning: Theory and Algorithms. 2021a. URLhttps://rltheorybook.github.io/. Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of p...

work page doi:10.1145/1015330.1015430 2000
[2]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L

URLhttps://arxiv.org/abs/2403.01857. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to fol...

work page doi:10.18653/v1/d18-1404 2022
[3]

Regularization via Mass Transportation

URLhttps://arxiv.org/abs/1710.10016. Harvineet Singh Shah, Michael Jung, Kyomin Jung, and Hwanjo Kim. Robust optimization for fairness with noisy protected groups. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5702–5709, 2020. S. Shalev-Shwartz and S. Ben-David.Understanding Machine Learning: From Theory to Algorithms. ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

For allx, y∈dom(f)andλ∈[0,1], f λx+ (1−λ)y ≤λf(x) + (1−λ)f(y)− σ 2 λ(1−λ)∥x−y∥ 2

work page
[5]

Lemma 9(Beck [2017], Theorem 5.25; Existence and uniqueness of minimizer).Let f:E→(−∞,∞] be proper, closed, andσ-strongly convex withσ >0

For allx∈dom(∂f),y∈dom(f)andg∈∂f(x), f(y)≥f(x) +⟨g, y−x⟩+ σ 2 ∥y−x∥ 2. Lemma 9(Beck [2017], Theorem 5.25; Existence and uniqueness of minimizer).Let f:E→(−∞,∞] be proper, closed, andσ-strongly convex withσ >0. Then: 1.fhas a unique minimizerx ∗

work page 2017
[6]

sup f∈ S∞ k=1 Fk 1 n nX i=1 σif(z i) # . 31 Using the inequalitysup f∈ S∞ k=1 Fk f(z i) = maxk supf∈F k f(z i)≤ P∞ k=1 supf∈F k f(z i), we have Rn ∞[ k=1 Fk ≤ ∞X k=1 Eσ

For allx∈dom(f), f(x)−f(x ∗)≥ σ 2 ∥x−x ∗∥2. 30 A.3 Distributionally Robust Optimization Thef-divergence between the distributionsPandP 0 inXis Df(P∥P 0) = Z X f dP dP0 dP0,(15) wherefis a convex function (e.g.,f(t) =tlogtgives KL divergence). For a lossℓ:X →Rthe following holds. Lemma 10(Duchi and Namkoong [2020], Proposition 1).LetD f be as in(15). Then ...

work page 2020
[7]

The Kullback-Leibler (KL) divergence of P from Q is given by DKL(P∥Q) = 1 2 (µ0 −µ 1)2 σ2 1 + σ2 0 σ2 1 −1−ln σ2 0 σ2 1

be two univariate normal distributions. The Kullback-Leibler (KL) divergence of P from Q is given by DKL(P∥Q) = 1 2 (µ0 −µ 1)2 σ2 1 + σ2 0 σ2 1 −1−ln σ2 0 σ2 1 . Theorem 14(Pinsker’s Inequality, adapted from Cover and Thomas [2006]).Let P and Q be two probability distributions on a measurable space (X,F) . Then the total variation distance between P and Q...

work page 2006
[8]

Slow Rate

For any estimator ˆθn, the minimax risk is bounded below by inf ˆθn sup θ∈{θ0,θ1} Eθ h ∥ˆθn −θ∥ p 2 i ≥ 1 2 ∥θ1 −θ 0∥2 2 p (1−d TV(P n θ0 , P n θ1)), where dTV(P n θ0 , P n θ1) is the total variation distance between the distributions of n i.i.d. observations from Pθ0 and Pθ1. B Proofs of Uniform Boundedness and Lipschitzness ofℓ(z;θ) B.1 Uniform Boundedn...

work page 2025
[9]

three-term

Using the “three-term” decomposition (as in the Wasserstein and KL proof), we get Lχ2 (ˆθχ2 n ;ε)− L χ2 (θχ2 ;ε)≤ Lχ2 (ˆθχ2 n ;ε)− L χ2 n (ˆθχ2 n ;ε) + Lχ2 n (θχ2 ;ε)− L χ2 (θχ2 ;ε) =K ℓ 1 + Kℓ 4λ r 2 log(4/δ) n . Finally, by the strong convexity ofL χ2 (cf. Lemma 9), λ η2 θχ2 − ˆθχ2 n 2 ≤ L χ2 (θχ2 ;ε)− L χ2 (ˆθχ2 n ;ε), Thus with probability at least1−δ...

work page
[10]

[2025], we know that the Wasserstein DPO loss, LW (θ), is γλ-strongly convex with respect to the Euclidean norm ∥·∥2

Verification of Local Strong ConvexityFrom Appendix B.3, Lemma 11 of Xu et al. [2025], we know that the Wasserstein DPO loss, LW (θ), is γλ-strongly convex with respect to the Euclidean norm ∥·∥2. This directly satisfies the first condition with a strong convexity parameter α=γλ where γ= β2e4βB (1+e4βB)2 and λ are from the data coverage assumption

work page 2025
[11]

Fast Rate

Verification of Lipschitz Loss (in θ) and hθ Linear In The Feature MapWe show that the pointwise DPO loss, ℓDPO(z;θ) =−ylogσ(βh θ)−(1−y) logσ(−βh θ), is Lipschitz in θ. The gradient with respect to θ is ∇θℓDPO(z;θ) =∂ℓ DPO/∂hθ · ∇θhθ. First, we bound the norm of the gradient of the preference score. Using the log-linear policy assumption: hθ(s, a1, a2) :=...

work page
[12]

[2025], the KL-DPO loss, LKL(θ), is γλ-strongly convex with respect to the Euclidean norm ∥·∥2

Verification of Local Strong ConvexityFrom Appendix C, Lemma 14 of Xu et al. [2025], the KL-DPO loss, LKL(θ), is γλ-strongly convex with respect to the Euclidean norm ∥·∥2. This directly satisfies the first condition with a strong convexity parameterα=γλwhereγ= β2e4βB (1+e4βB)2 andλis from the data coverage assumption

work page 2025
[13]

[2025], we know that the pointwise DPO loss is uniformly bounded by log(1 +e 4βB)

Verification of Uniform BoundednessFrom Appendix B.2, Lemma 9 of Xu et al. [2025], we know that the pointwise DPO loss is uniformly bounded by log(1 +e 4βB). This directly satisfies the conditions needed for Master Theorem. 56 Thus all four conditions of the Master Theorem have been verified for the KL-DPO problem. We can now substitute the derived consta...

work page 2025
[14]

Define a search interval[L, U], whereU= max i{ℓi}andLis a sufficiently small lower bound

work page
[15]

At each iteration, select a candidateη c = (L+U)/2

work page
[16]

This takesO(n)time as it requires summing over thenloss terms

Compute a subgradientg c ∈∂f(η c). This takesO(n)time as it requires summing over thenloss terms

work page
[17]

Ifg c >0, the minimum must lie to the left, so we setU=η c

work page
[18]

This procedure is repeated until the interval [L, U] is sufficiently small

Ifg c <0, the minimum must lie to the right, so we setL=η c. This procedure is repeated until the interval [L, U] is sufficiently small. The number of iterations required to achieve a desired precision ϵ is O(log((U−L)/ϵ)) . The total complexity of this search is O(nlog(1/ϵ)) . For Algorithm 4, if we assume thatCard ({ℓ i}n i=1) =n, then the runtime will ...

work page
[19]

Stationarity (at eachk): ∇qk L(q, a, λ, ν, γ) =a k − 2λ pk (qk −p k) +ν+γ k = 0

work page
[20]

Primal feasibility: q∈∆ K−1 , KX k=1 (qk−pk)2 pk ≤ρ

work page
[21]

Dual feasibility: λ≥0, γ≥0

work page
[22]

active” contribution equals the “inactive

Complementary Slackness: λ KX k=1 (qk −p k)2 pk −ρ ! = 0, γ kqk = 0∀k∈[K]. We will denote µ= PK k=1 pkak. We will consider two cases: one where the q∗ ∈Int(∆ K−1) and another where q∗ ∈∂∆ K−1. Interior Case.Given q∗ ∈Int(∆ K−1), we haveqk >0∀k∈[K] . Therefore, in order for the second complementary slackness condition to hold, we needγ k = 0∀k∈[K]. Station...

work page 2000
[23]

first stage

was assigned to the pair (a1, a2). This was not a deterministic selection based on the mixed reward, but rather a stochastic process following a Bradley-Terry model. Specifically, a random number was drawn, and if it was less than p= exp(ra1) exp(ra1)+exp(ra2), then a1 was marked as preferred (preference = 1); otherwise, a2 was preferred (preference = 0)....

work page 2020