PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

Arnav Raj

arxiv: 2606.27578 · v1 · pith:ULSATT5Rnew · submitted 2026-06-25 · 💻 cs.LG · cs.AI

PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration

Arnav Raj This is my paper

Pith reviewed 2026-06-29 01:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords RLHFreward modelingempirical Bayescalibrationannotator variabilityshrinkage estimationpreference data

0 comments

The pith

PEBS shrinks per-rater affine calibrators toward the population mean in closed form, cutting held-out RMSE by 8.58 percent on PRISM reward ratings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward models in RLHF usually pool all annotators and fit one global affine map from raw scores to calibrated values. This erases systematic differences in how individual raters use the scale. PEBS instead fits a separate affine calibrator for each annotator on a held-out slice of their data, then applies Morris-James-Stein shrinkage to pull those per-rater maps toward the overall population mean. The result is a closed-form post-hoc adjustment that leaves the underlying reward model untouched and improves within-annotator prediction accuracy.

Core claim

PEBS is a per-rater empirical-Bayes shrinkage estimator that fits affine calibrators on held-out slices of each annotator's ratings and shrinks them toward the population mean without retraining the reward model. On the PRISM dataset it reduces within-user held-out RMSE by 8.58 percent relative to the pooled population-slope baseline; the same procedure yields a 9.66 percent RMSE reduction on PluriHarms harm ratings using a Qwen-2.5 base model.

What carries the argument

per-rater empirical-Bayes shrinkage applied to affine calibration maps

If this is right

Inference-time rating calibration becomes annotator-specific while the base reward model stays fixed.
No additional training or gradient steps are required beyond the original reward-model fit.
The same shrinkage procedure can be applied to other rating tasks that use affine maps, such as harm scoring.
Population-level calibration remains available as the shrinkage target when per-rater data are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to non-affine calibrators if a parametric form is chosen that still admits closed-form shrinkage.
In production RLHF pipelines the per-rator maps could be updated incrementally as new ratings arrive without refitting the base model.
If annotator identities are unavailable at inference, the population mean serves as a natural fallback.

Load-bearing premise

A held-out slice of each annotator's ratings exists and is representative enough to fit stable per-rater affine calibrators before shrinkage.

What would settle it

Run the method on a dataset where each annotator has too few ratings for a stable held-out fit; if RMSE improvement disappears or reverses, the per-rater calibration step is not adding value.

Figures

Figures reproduced from arXiv: 2606.27578 by Arnav Raj.

**Figure 1.** Figure 1: The Phi-3-medium-14B in-family case falls within ±5 pp of the single-seed anchor (+43.23%, shaded band); the Qwen-2.5 row replicates in-family on a different metric (HelpSteer2 pooled-RMSE, +18.24%). Forest plot of point estimates with 95% row-cluster bootstrap confidence intervals, grouped by base-model family. The three Llama-family-dense bases are shown second as scope characterization: on a coherence h… view at source ↗

**Figure 2.** Figure 2: PEBS reduces RMSE on four single-corpus replications and on a 195,963-observation pooled corpus, all using a single Qwen-2.5 base model. Horizontal forest of within-cluster gain (%) with 95% BCa cluster-bootstrap CIs; circles are singlecorpus replications, the diamond is the four-corpus pooled estimate. The dashed reference at zero is the pop-slope baseline. The pooled-multi-corpus row (+7.19% [+6.36, +… view at source ↗

**Figure 3.** Figure 3: Both single-component estimators degrade RMSE on PluriHarms; only the joint estimator yields the gain on both corpora. RMSE reduction (%) vs. pop-slope, 95% BCa CIs, dashed reference at zero. Bars use the cross-corpus evaluation protocol of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: The verbosity-only control confirms the reversal is attribute-specific, not architecture-wide. One group per base: blue bars are the untrained-coherence gain (%), vermillion bars the trained-verbosity gain (%), with 95% BCa CIs. On all four bases the untrained coherence head stays positive and within ∼1 pp of the Phi-3 reference, while the trained verbosity head itself turns negative (−84.4 / −44.4 / −43.9… view at source ↗

**Figure 6.** Figure 6: PEBS automatically down-weights sparse annotators: shrinkage is largest for nj≤8 and fades to zero at high nj with no threshold to tune. The closed-form weight ωj=τ 2 /(τ 2+Vj ) governs how much PEBS trusts each rater’s own calibrator vs. the population mean. Three illustrative populations (PRISM, PluriHarms, HelpSteer2) are plotted against per-user sample sizes nj and within-rater noise Vj ∝ 1/nj ; the sh… view at source ↗

**Figure 7.** Figure 7: Adapter prediction-spread (σpred,coh) for three across-family bases, against the 0.40 collapse threshold. The two inversion bases (Llama-3-8B, Yi-1.5-34B) fall below the threshold; the null base (Mistral-Small-22B) lies above. Lower σpred,coh indicates tighter clustering of adapter outputs around one or two rating values, consistent with a head-collapsed adapter setting. We treat the signature as observati… view at source ↗

**Figure 8.** Figure 8: Per-user RMSE improvement scatter on PRISM (N=1,394 users), illustrating the minority-rater trade-off of EB shrinkage. Each point is one user; the x-axis is per-user sample size nj (log scale) and the y-axis is the per-user RMSE improvement (pop-slope minus PEBS-shrunk, as a percentage of pop-slope RMSE). Blue points (1,002 users, 71.9%) are helped by PEBS; vermillion points (392, 28.1%) are hurt. Low-nj u… view at source ↗

read the original abstract

Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slopes into a single average-rater fit that does not match any individual annotator. PEBS is a per-rater empirical-Bayes shrinkage estimator: it fits per-rater affine calibrators on a held-out slice of each annotator's ratings and applies Morris-James-Stein empirical-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model. On PRISM, PEBS reduces within-user held-out RMSE by 8.58% over the pooled population-slope baseline. The procedure replicates on PluriHarms harm ratings (Qwen-2.5 base, in-family) with a +9.66% RMSE reduction over the same population-slope baseline. PEBS is a closed-form post-hoc estimator for annotator-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater-level map used at inference time for new ratings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEBS applies standard empirical-Bayes shrinkage to per-rater affine maps in RLHF and reports 8-10% RMSE drops, but the abstract leaves per-annotator data volumes and split details unspecified.

read the letter

This paper introduces PEBS, a closed-form post-hoc estimator that fits per-rater affine calibrators on held-out slices and shrinks them toward the population mean using Morris-James-Stein. It reports an 8.58% RMSE reduction on PRISM within-user held-out data and a 9.66% reduction on PluriHarms, both over a pooled population-slope baseline, without retraining the base reward model.

What stands out is the direct application of established shrinkage to rater-specific affine maps in the RLHF setting. The method stays simple and inference-only, which matches a real need when annotators use different rating scales. Replicating the gain on a second dataset with a different base model adds some credibility.

The main limitation is the evaluation setup. The gains assume each annotator has a held-out slice large enough for a stable intercept and slope before shrinkage. The abstract supplies no per-annotator rating counts, no minimum thresholds, and no description of how the slices were chosen. Without those numbers it is hard to rule out that the improvement is tied to the particular data split rather than a general calibration benefit. No error bars or derivation checks appear either.

The work targets RLHF practitioners who already collect per-rater data and want a lightweight way to improve calibration. A reader running reward-model experiments could test the procedure quickly if they have the raw ratings.

I would send this to peer review. The core procedure is clean and the reported numbers are concrete, so referees can verify the missing statistics and confirm whether the gains hold under different splits.

Referee Report

2 major / 0 minor

Summary. The paper proposes PEBS, a closed-form per-rater empirical-Bayes shrinkage estimator for affine calibration of RLHF reward models. It fits per-rater intercept and slope on a held-out slice of each annotator's ratings, shrinks these toward the population mean via Morris-James-Stein, and applies the resulting rater-specific map at inference without retraining the base reward model. On PRISM it reports an 8.58% within-user held-out RMSE reduction over a pooled population-slope baseline; the result replicates on PluriHarms harm ratings (Qwen-2.5) with a 9.66% reduction.

Significance. If the per-rater estimates prove stable, PEBS supplies a lightweight, post-hoc calibration step that directly addresses annotator heterogeneity while leaving the underlying reward model unchanged. The closed-form nature and replication on two distinct datasets (PRISM and PluriHarms) are concrete strengths that would make the method easy to adopt.

major comments (2)

[Abstract] Abstract: the reported 8.58% and 9.66% RMSE reductions rest on the assumption that a held-out slice exists for each annotator that is large enough to yield stable per-rater affine fits before shrinkage; no per-annotator rating counts, minimum thresholds, or slice-size statistics are supplied, leaving the central empirical claim unverifiable.
[Abstract] Abstract and evaluation description: the RMSE improvements are stated without error bars, standard errors, or any statistical test, so it is impossible to determine whether the observed gains exceed sampling variability in the within-user held-out slices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on verifiability and statistical reporting. We address each major comment below and will revise the manuscript to incorporate the requested details and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 8.58% and 9.66% RMSE reductions rest on the assumption that a held-out slice exists for each annotator that is large enough to yield stable per-rater affine fits before shrinkage; no per-annotator rating counts, minimum thresholds, or slice-size statistics are supplied, leaving the central empirical claim unverifiable.

Authors: We agree that the absence of per-annotator rating statistics limits verifiability of the empirical claims. In the revised manuscript we will add a dedicated subsection (or appendix table) reporting the distribution of ratings per annotator (mean, median, min, max, and quartiles), the minimum rating count threshold used for inclusion, and the exact sizes of the held-out slices employed for fitting the per-rater affine calibrators on both PRISM and PluriHarms. revision: yes
Referee: [Abstract] Abstract and evaluation description: the RMSE improvements are stated without error bars, standard errors, or any statistical test, so it is impossible to determine whether the observed gains exceed sampling variability in the within-user held-out slices.

Authors: We acknowledge that the reported RMSE reductions lack measures of uncertainty or formal significance tests. In revision we will augment the abstract and evaluation sections with standard errors obtained via bootstrap resampling over annotators (or via repeated random held-out splits), and we will report the results of a paired statistical test (e.g., Wilcoxon signed-rank or t-test on per-annotator RMSE differences) to assess whether the observed improvements are statistically distinguishable from zero. revision: yes

Circularity Check

0 steps flagged

No significant circularity; estimator independent of reported metric

full rationale

The paper defines PEBS as a closed-form post-hoc procedure that fits per-rater affine calibrators on held-out slices then applies standard Morris-James-Stein shrinkage toward the population mean. The reported within-user held-out RMSE reductions are evaluated on data separate from the fitting slices, and the shrinkage formulas are taken from classical empirical-Bayes results rather than tuned to the target RMSE numbers. No self-citations, ansatzes, or uniqueness theorems from the authors are invoked as load-bearing steps, and the derivation does not reduce any claimed prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that rater differences are well captured by per-rater affine maps and that held-out per-rater slices are available; no free parameters or invented entities are introduced beyond standard empirical-Bayes machinery.

axioms (1)

domain assumption Rater differences in RLHF preferences are adequately modeled by per-rater affine transformations of a shared reward model output.
The method fits and shrinks only affine calibrators; this premise is required for the per-rater maps to be meaningful.

pith-pipeline@v0.9.1-grok · 5723 in / 1324 out tokens · 25795 ms · 2026-06-29T01:23:44.567214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 1 canonical work pages

[1]

Christiano, P

arXiv:2407.17387. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv
[2]

Coste, T., Anwar, U., Kirk, R., and Krueger, D

PMLR 235; arXiv:2404.10271. Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward model ensembles help mitigate overoptimization. InInternational Conference on Learning Representations (ICLR),

arXiv
[3]

C., Dey, P., and Ferrara, E

Ghafouri, B., Choi, E. C., Dey, P., and Ferrara, E. Measuring human preferences in RLHF is a social science problem. arXiv preprint arXiv:2604.03238,

Pith/arXiv arXiv
[4]

EBPO: Empirical bayes shrinkage for stabilizing group-relative policy optimization

Han, K., Zhou, Y ., Gao, M., Zhou, G., Li, S., Kumar, A., Fan, X., Li, W., and Zhang, L. EBPO: Empirical bayes shrinkage for stabilizing group-relative policy optimization. arXiv preprint arXiv:2602.05165,

arXiv
[5]

Kobalczyk, K

arXiv:2404.16019. Kobalczyk, K. and van der Schaar, M. Preference learning for AI alignment: A causal perspective. InInternational Conference on Machine Learning (ICML),

arXiv
[6]

K¨opf, A., Kilcher, Y ., von R¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N

arXiv:2506.05967. K¨opf, A., Kilcher, Y ., von R¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. OpenAs- sistant conversations – democratizing large language model alignment. InAdvances in Neural Inf...

Pith/arXiv arXiv
[7]

arXiv:2304.07327. Kou, S. C. and Yang, J. J. Optimal shrinkage estimation in heteroscedastic hierarchical linear models. InBig and Com- plex Data Analysis, Contributions to Statistics, pp. 249–284. Springer,

arXiv
[8]

doi: 10.1007/978-3-319-41573-4

work page doi:10.1007/978-3-319-41573-4
[9]

PluriHarms: Benchmarking the full spectrum of hu- man judgments on AI harm.arXiv preprint arXiv:2601.08951,

Li, J.-J., Mire, J., Fleisig, E., Pyatkin, V ., Collins, A., Sap, M., and Levine, S. PluriHarms: Benchmarking the full spectrum of hu- man judgments on AI harm.arXiv preprint arXiv:2601.08951,

arXiv
[10]

Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y

Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-reward: Bag of tricks for reward modeling in LLMs.arXiv preprint arXiv:2410.18451,

Pith/arXiv arXiv
[11]

Liu, P., Lu, J., and Sun, W. W. Uncertainty quantification for large language model reward learning under heterogeneous human feedback.arXiv preprint arXiv:2512.03208,

arXiv
[12]

Personalized RewardBench: Evaluating reward models with human aligned personalization

Ma, Q., Gao, D., Cai, R., Zhao, B., Zhou, H., Zhang, J., and Zhao, Z. Personalized RewardBench: Evaluating reward models with human aligned personalization. arXiv preprint arXiv:2604.07343,

Pith/arXiv arXiv
[13]

A., Ha- jishirzi, H., and Lambert, N

9 PEBS for RLHF Reward Calibration Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N. A., Ha- jishirzi, H., and Lambert, N. RewardBench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,

Pith/arXiv arXiv
[14]

Miranda, L. J. V ., Wang, Y ., Elazar, Y ., Kumar, S., Pyatkin, V ., Brahman, F., Smith, N. A., Hajishirzi, H., and Dasigi, P. Hy- brid preferences: Learning to route instances for human vs. AI feedback.arXiv preprint arXiv:2410.19133,

arXiv
[15]

The reward model selection crisis in personal- ized alignment

Rezk, F., Pan, Y ., Foo, C.-S., Xu, X., Chen, N., Gouk, H., and Hospedales, T. The reward model selection crisis in personal- ized alignment. arXiv preprint arXiv:2512.23067,

arXiv
[16]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv
[17]

Value kaleidoscope: Engaging AI with plural- istic human values, rights, and duties

Sorensen, T., Jiang, L., Hwang, J., Levine, S., Pyatkin, V ., West, P., Dziri, N., Lu, X., Rao, K., Bhagavatula, C., Sap, M., Tasioulas, J., and Choi, Y . Value kaleidoscope: Engaging AI with plural- istic human values, rights, and duties. InAAAI Conference on Artificial Intelligence, 2024a. arXiv:2309.00779. Sorensen, T., Moore, J., Fisher, J., Gordon, M...

arXiv
[18]

Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J

URL https:// github.com/huggingface/trl. Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. HelpSteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673,

arXiv
[19]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ....

Pith/arXiv arXiv
[20]

H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussman, J., Sheynin, Y ., Titus, L., Radharapu, B., Yu, J., Sarma, V ., Rose, K., and Nickel, M

Zhang, L. H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussman, J., Sheynin, Y ., Titus, L., Radharapu, B., Yu, J., Sarma, V ., Rose, K., and Nickel, M. Cultivating pluralism in algorithmic monoculture: The community alignment dataset. arXiv preprint arXiv:2507.09650,

arXiv
[21]

10 PEBS for RLHF Reward Calibration A

Oral; arXiv:2602.12116. 10 PEBS for RLHF Reward Calibration A. Proof of Theorem 1 (oracle inequality) We prove Theorem 1 in four steps: (i) a mean-squared error bound for the truncated Morris MoM estimator ˆτ2, (ii) a second- order Taylor expansion with Lagrange remainder around the oracle, (iii) aggregation across raters using the independence delivered ...

arXiv 1983
[22]

The off-event excess risk is therefore at most max(1, M)τ 2 per rater. For the probability, ˜τ2 −τ 2 is a centred Gaussian quadratic form whose coefficient vector satisfies ∥λ∥∞ ≤σ 2 max/(J−1)and∥λ∥ 2 2 ≤σ 4 max/(J−1), so the Hanson–Wright inequality gives P(E c) =P |˜τ2 −τ 2|> τ 2/2 ≤2 exp −c2 J−1 (1 +M) 2 for an absolute constant c2 >0 (the exponent is ...

2012
[23]

Bootstrap CIs are 95% BCa (Efron,

LoRA r=32, α=16, lr 10−4, bf16, 1,500 steps, centered-rewards regularizer (Eisenstein et al., 2024), pair accuracy CI [62.74,65.29] , ≈75 min H100 80 GB. Bootstrap CIs are 95% BCa (Efron,

2024
[24]

Methods whose published protocols optimise a different objective, metric, or feature space are cited in related work but are not reproduced as direct scalar-RMSE comparison rows here, since the protocol mismatch makes the resulting numbers incomparable. D. Dataset cards This appendix expands the corpora used in §3.4 (the three within-scope continuous-rati...

2024

[1] [1]

Christiano, P

arXiv:2407.17387. Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems (NeurIPS),

arXiv

[2] [2]

Coste, T., Anwar, U., Kirk, R., and Krueger, D

PMLR 235; arXiv:2404.10271. Coste, T., Anwar, U., Kirk, R., and Krueger, D. Reward model ensembles help mitigate overoptimization. InInternational Conference on Learning Representations (ICLR),

arXiv

[3] [3]

C., Dey, P., and Ferrara, E

Ghafouri, B., Choi, E. C., Dey, P., and Ferrara, E. Measuring human preferences in RLHF is a social science problem. arXiv preprint arXiv:2604.03238,

Pith/arXiv arXiv

[4] [4]

EBPO: Empirical bayes shrinkage for stabilizing group-relative policy optimization

Han, K., Zhou, Y ., Gao, M., Zhou, G., Li, S., Kumar, A., Fan, X., Li, W., and Zhang, L. EBPO: Empirical bayes shrinkage for stabilizing group-relative policy optimization. arXiv preprint arXiv:2602.05165,

arXiv

[5] [5]

Kobalczyk, K

arXiv:2404.16019. Kobalczyk, K. and van der Schaar, M. Preference learning for AI alignment: A causal perspective. InInternational Conference on Machine Learning (ICML),

arXiv

[6] [6]

K¨opf, A., Kilcher, Y ., von R¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N

arXiv:2506.05967. K¨opf, A., Kilcher, Y ., von R¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., ES, S., Suri, S., Glushkov, D., Dantuluri, A., Maguire, A., Schuhmann, C., Nguyen, H., and Mattick, A. OpenAs- sistant conversations – democratizing large language model alignment. InAdvances in Neural Inf...

Pith/arXiv arXiv

[7] [7]

arXiv:2304.07327. Kou, S. C. and Yang, J. J. Optimal shrinkage estimation in heteroscedastic hierarchical linear models. InBig and Com- plex Data Analysis, Contributions to Statistics, pp. 249–284. Springer,

arXiv

[8] [8]

doi: 10.1007/978-3-319-41573-4

work page doi:10.1007/978-3-319-41573-4

[9] [9]

PluriHarms: Benchmarking the full spectrum of hu- man judgments on AI harm.arXiv preprint arXiv:2601.08951,

Li, J.-J., Mire, J., Fleisig, E., Pyatkin, V ., Collins, A., Sap, M., and Levine, S. PluriHarms: Benchmarking the full spectrum of hu- man judgments on AI harm.arXiv preprint arXiv:2601.08951,

arXiv

[10] [10]

Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y

Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-reward: Bag of tricks for reward modeling in LLMs.arXiv preprint arXiv:2410.18451,

Pith/arXiv arXiv

[11] [11]

Liu, P., Lu, J., and Sun, W. W. Uncertainty quantification for large language model reward learning under heterogeneous human feedback.arXiv preprint arXiv:2512.03208,

arXiv

[12] [12]

Personalized RewardBench: Evaluating reward models with human aligned personalization

Ma, Q., Gao, D., Cai, R., Zhao, B., Zhou, H., Zhang, J., and Zhao, Z. Personalized RewardBench: Evaluating reward models with human aligned personalization. arXiv preprint arXiv:2604.07343,

Pith/arXiv arXiv

[13] [13]

A., Ha- jishirzi, H., and Lambert, N

9 PEBS for RLHF Reward Calibration Malik, S., Pyatkin, V ., Land, S., Morrison, J., Smith, N. A., Ha- jishirzi, H., and Lambert, N. RewardBench 2: Advancing reward model evaluation.arXiv preprint arXiv:2506.01937,

Pith/arXiv arXiv

[14] [14]

Miranda, L. J. V ., Wang, Y ., Elazar, Y ., Kumar, S., Pyatkin, V ., Brahman, F., Smith, N. A., Hajishirzi, H., and Dasigi, P. Hy- brid preferences: Learning to route instances for human vs. AI feedback.arXiv preprint arXiv:2410.19133,

arXiv

[15] [15]

The reward model selection crisis in personal- ized alignment

Rezk, F., Pan, Y ., Foo, C.-S., Xu, X., Chen, N., Gouk, H., and Hospedales, T. The reward model selection crisis in personal- ized alignment. arXiv preprint arXiv:2512.23067,

arXiv

[16] [16]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv

[17] [17]

Value kaleidoscope: Engaging AI with plural- istic human values, rights, and duties

Sorensen, T., Jiang, L., Hwang, J., Levine, S., Pyatkin, V ., West, P., Dziri, N., Lu, X., Rao, K., Bhagavatula, C., Sap, M., Tasioulas, J., and Choi, Y . Value kaleidoscope: Engaging AI with plural- istic human values, rights, and duties. InAAAI Conference on Artificial Intelligence, 2024a. arXiv:2309.00779. Sorensen, T., Moore, J., Fisher, J., Gordon, M...

arXiv

[18] [18]

Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J

URL https:// github.com/huggingface/trl. Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. HelpSteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673,

arXiv

[19] [19]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ....

Pith/arXiv arXiv

[20] [20]

H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussman, J., Sheynin, Y ., Titus, L., Radharapu, B., Yu, J., Sarma, V ., Rose, K., and Nickel, M

Zhang, L. H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussman, J., Sheynin, Y ., Titus, L., Radharapu, B., Yu, J., Sarma, V ., Rose, K., and Nickel, M. Cultivating pluralism in algorithmic monoculture: The community alignment dataset. arXiv preprint arXiv:2507.09650,

arXiv

[21] [21]

10 PEBS for RLHF Reward Calibration A

Oral; arXiv:2602.12116. 10 PEBS for RLHF Reward Calibration A. Proof of Theorem 1 (oracle inequality) We prove Theorem 1 in four steps: (i) a mean-squared error bound for the truncated Morris MoM estimator ˆτ2, (ii) a second- order Taylor expansion with Lagrange remainder around the oracle, (iii) aggregation across raters using the independence delivered ...

arXiv 1983

[22] [22]

The off-event excess risk is therefore at most max(1, M)τ 2 per rater. For the probability, ˜τ2 −τ 2 is a centred Gaussian quadratic form whose coefficient vector satisfies ∥λ∥∞ ≤σ 2 max/(J−1)and∥λ∥ 2 2 ≤σ 4 max/(J−1), so the Hanson–Wright inequality gives P(E c) =P |˜τ2 −τ 2|> τ 2/2 ≤2 exp −c2 J−1 (1 +M) 2 for an absolute constant c2 >0 (the exponent is ...

2012

[23] [23]

Bootstrap CIs are 95% BCa (Efron,

LoRA r=32, α=16, lr 10−4, bf16, 1,500 steps, centered-rewards regularizer (Eisenstein et al., 2024), pair accuracy CI [62.74,65.29] , ≈75 min H100 80 GB. Bootstrap CIs are 95% BCa (Efron,

2024

[24] [24]

Methods whose published protocols optimise a different objective, metric, or feature space are cited in related work but are not reproduced as direct scalar-RMSE comparison rows here, since the protocol mismatch makes the resulting numbers incomparable. D. Dataset cards This appendix expands the corpora used in §3.4 (the three within-scope continuous-rati...

2024