Bayesian Preference Learning for Test-Time Steerable Reward Models

Jiwoo Hong; Shao Tang; Zhipeng Wang

arxiv: 2602.08819 · v2 · pith:UDVBFP75new · submitted 2026-02-09 · 💻 cs.LG · cs.CL

Bayesian Preference Learning for Test-Time Steerable Reward Models

Jiwoo Hong , Shao Tang , Zhipeng Wang This is my paper

Pith reviewed 2026-05-21 13:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords Bayesian reward modelingtest-time adaptationin-context learningpreference learningvariational inferenceBradley-Terry modelreinforcement learning alignmentmulti-objective optimization

0 comments

The pith

Bayesian method makes reward models adapt to new preferences at test time using example demonstrations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reward models can be made steerable at test time by casting preference modeling as amortized variational inference over a latent probability in the Bradley-Terry model with a conjugate Beta prior. In-context demonstrations allow the model to update its preference estimates for both single and multi-objective cases without retraining. A sympathetic reader would care because conventional reward models stay fixed after training and therefore cannot handle shifting or conflicting human values in reinforcement learning alignment. The approach reports concrete gains such as higher benchmark accuracy and expanded Pareto frontiers when preferences conflict.

Core claim

ICRM performs amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior, enabling adaptation to unseen preference distributions at test time through in-context demonstrations while admitting a global interior optimum with finite confidence.

What carries the argument

Variational In-Context Reward Modeling (ICRM) that infers latent preferences from demonstrations via amortized variational inference under the Bradley-Terry model with a Beta prior.

If this is right

RM-Bench accuracy rises from 60.5 to 70.8 as more demonstrations are supplied at test time.
Calibration error on moral dilemma preferences falls below that of a generative judge.
The attainable Pareto frontier expands under conflicting preferences.
The method outperforms a conventional reward model on math reasoning tasks during RL training.
The variational objective admits a global interior optimum with finite confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same test-time mechanism could support per-user personalization by supplying different demonstration sets to different users.
It could reduce the frequency of full reward model retraining when preference distributions drift in deployed systems.
Combining the Beta prior with richer variational families might capture multimodal or context-dependent preferences more faithfully.
The approach may extend naturally to online preference collection loops during reinforcement learning.

Load-bearing premise

A small number of in-context preference demonstrations combined with a conjugate Beta prior and variational inference suffice to capture and steer complex preference distributions at test time without large approximation errors.

What would settle it

If increasing the number of in-context demonstrations produces no further gains in RM-Bench accuracy or fails to lower calibration error below that of a generative judge, the central claim would be undermined.

Figures

Figures reproduced from arXiv: 2602.08819 by Jiwoo Hong, Shao Tang, Zhipeng Wang.

**Figure 1.** Figure 1: Variational in-context reward modeling (ICRM) with Beta prior for the Bradley-Terry (BT) model. ICRM directly models the mean and sharpness of the Beta posterior, calibrated to how “confident” the model is for the preference triplet (x, yw, yl) given in-context preference demonstrations. This yields multi-objective test-time steerability of the reward model for any preferences or tasks. lytical form due to… view at source ↗

**Figure 2.** Figure 2: Ablation study. The learning curve of the preference mean µ and the concentration factor τ of the parameterized Beta posterior in the variational in-context reward modeling. Weaker KL regularization, i.e., smaller λ, leads to stronger adaptation to the training data [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Trend of the confidence factor τ as number of in-context preference demonstrations increase for Qwen3-4B-Base ICRM. τ values were collected from the SafeRLHF evaluation results. Multi-objective test-time steerability We study ICRM’s capacity to balance conflicting objectives via test-time steering. We select Safety-Should-Respond and Safety-ShouldRefuse subsets of “Safety” domain of RM-Bench, comprising… view at source ↗

**Figure 4.** Figure 4: Multi-objective steerability analysis. Pareto frontiers of ICRM trained on Llama-3.2-3B-Base (Figure 4a) and Qwen3-4B-Base (Figure 4b), and the Hypervolume (HV) of the Pareto frontiers plotted against the number of in-context demonstrations N (Figure 4c). creases monotonically with increasing N, from less than 0.95 to over 0.98. This indicates that additional demonstrations consistently expand the achieva… view at source ↗

**Figure 5.** Figure 5: Parameterizing verifiable rewards. Accuracy mean (“Accuracy (%)”) and average rewards (“Training Reward”) of eight sample responses per query [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation accuracy of intermediate checkpoints in RL training. The evaluation was done on MATH500 dataset. As an extension of Section 7, we report the step-level evaluation results of the policies trained with RL with different reward models. In [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ICRM adds test-time steerability to reward models via amortized variational inference over a Beta-Bradley-Terry latent, with reported benchmark gains but open questions on how well the approximation holds for complex preferences.

read the letter

The main thing to know is that this paper turns static reward models into something you can steer at test time by feeding in a few preference examples. It frames the problem as amortized variational inference over a latent preference probability, using a conjugate Beta prior on top of the Bradley-Terry model to enable in-context adaptation for both single and multi-objective cases.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Variational In-Context Reward Modeling (ICRM), a Bayesian method for reward modeling that enables test-time steerability of reward models via in-context preference demonstrations. It casts the problem as amortized variational inference over a latent preference probability under the Bradley-Terry model with a conjugate Beta prior. The paper claims that ICRM adapts to unseen preference distributions in both single- and multi-objective settings, reports an RM-Bench accuracy increase from 60.5 to 70.8 with additional demonstrations, lower calibration error than generative judges on moral dilemmas, expansion of the attainable Pareto frontier under conflicting preferences, improved performance over conventional RMs in math reasoning for RL training, and theoretical guarantees that the variational objective admits a global interior optimum with finite confidence while KL regularization mitigates over-optimization.

Significance. If the empirical gains and theoretical guarantees hold under full verification, the work would be significant for RLHF and multi-objective alignment. Enabling test-time adaptation without retraining addresses a core limitation of static reward models, and the combination of conjugate Beta priors with amortized inference offers a principled route to steerability. The reported RM-Bench gains, Pareto expansion, and outperformance on verifiable math rewards indicate practical relevance, while the theoretical analysis of the variational objective and KL effects provides a foundation that could influence future steerable model designs.

major comments (2)

[§4] §4 (Theoretical analysis): The guarantee of a global interior optimum with finite confidence is load-bearing for the central theoretical claim; the manuscript must provide the key proof steps or assumptions on the variational family and prior to allow verification that the result does not reduce to a trivial consequence of the conjugate Beta-Bradley-Terry construction.
[Experimental results] Experimental results (RM-Bench and Pareto sections): The accuracy lift from 60.5 to 70.8 and Pareto frontier expansion are central to the adaptation claim; the paper should report the exact number of in-context demonstrations, statistical significance across seeds, and an ablation isolating the effect of the Beta prior versus standard in-context prompting to rule out that gains arise from generic few-shot effects.

minor comments (2)

[Abstract and §2] Abstract and §2: The acronym ICRM is expanded on first use, but the precise definition of the latent preference probability variable should be introduced with consistent notation before the variational objective is stated to prevent ambiguity with Bradley-Terry probabilities.
[Figure captions and §5] Figure captions and §5: Calibration error plots would benefit from explicit error bars or confidence intervals to support the claim of lower error than the generative judge.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our theoretical results and experimental claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Theoretical analysis): The guarantee of a global interior optimum with finite confidence is load-bearing for the central theoretical claim; the manuscript must provide the key proof steps or assumptions on the variational family and prior to allow verification that the result does not reduce to a trivial consequence of the conjugate Beta-Bradley-Terry construction.

Authors: We agree that the current presentation of the theoretical guarantee in §4 requires expansion for full verifiability. The result is not a direct trivial consequence of conjugacy; it additionally relies on the structure of the amortized variational family (a mean-field approximation over the Beta distribution parameters) and the specific form of the evidence lower bound under the Bradley-Terry likelihood. In the revised manuscript we will add an appendix with the key proof steps, including the derivation of the interior optimum condition and explicit statements of the assumptions on the variational family and prior. This will allow independent verification while preserving the original claim. revision: yes
Referee: [Experimental results] Experimental results (RM-Bench and Pareto sections): The accuracy lift from 60.5 to 70.8 and Pareto frontier expansion are central to the adaptation claim; the paper should report the exact number of in-context demonstrations, statistical significance across seeds, and an ablation isolating the effect of the Beta prior versus standard in-context prompting to rule out that gains arise from generic few-shot effects.

Authors: We accept this recommendation. The reported RM-Bench improvement of 60.5 to 70.8 was obtained with 8 in-context demonstrations; we will state this number explicitly and add standard errors plus statistical significance results across three random seeds. We will also include a new ablation that compares ICRM against a non-Bayesian in-context prompting baseline (identical architecture and demonstrations but without the variational Beta prior and amortized inference). This ablation will be reported in the RM-Bench and Pareto sections to isolate the contribution of the conjugate prior formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained on standard models

full rationale

The paper casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. These are established external components, not redefined by the paper's own equations. Reported test-time adaptation, RM-Bench gains from 60.5 to 70.8, calibration improvements, Pareto expansion, and the global interior optimum guarantee are presented as empirical and theoretical outcomes of the new variational objective rather than reductions to fitted inputs or self-citation chains. No load-bearing step in the abstract or described claims reduces by construction to the inputs; the method remains falsifiable against external benchmarks and does not rename known results or smuggle ansatzes via self-citation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Review based solely on abstract; full details on parameters and assumptions unavailable. The approach relies on standard preference modeling assumptions and introduces a latent variable updated via variational methods.

free parameters (1)

Beta prior hyperparameters
Conjugate Beta prior parameters for the latent preference probability; likely chosen or tuned to enable efficient updates.

axioms (1)

domain assumption Bradley-Terry model governs pairwise preferences
Standard assumption in reward modeling for converting preferences into probabilities.

invented entities (1)

Latent preference probability no independent evidence
purpose: To represent the underlying distribution that can be steered at test time
Core modeling choice that enables the Bayesian update with in-context data.

pith-pipeline@v0.9.0 · 5751 in / 1362 out tokens · 44741 ms · 2026-05-21T13:15:07.501121+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior... LICRM(µ, τ;α0, β0) = −(ψ(µτ)−ψ(τ)) + λ(N) DKL(Beta(µτ,(1−µ)τ)∥Beta(α0,β0))
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 8.2 ... every global minimizer (µ⋆, τ⋆) of LICRM satisfies 0<µ⋆<1 and 0<τ⋆<∞

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

URLhttp://www.jstor.org/ stable/2334029

URL https://proceedings.mlr.press/ v162/bong22a.html. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/ stable/2334029. Chen, C. and Smith, T. M. A bayes-type estimator for the bradley-terry model for paired comparison.J...

work page arXiv 1952
[2]

URL https://openreview.net/forum? id=dliIIodM6b. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aha- roni, A., Lintz, N., Pais, T. C., Jacobsson, H., Szpektor, I., Jiang, N.-J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Das, N., Chakraborty, S., Pacchiano, A., and Ray Chowd- hury, S

URL https://openreview.net/forum? id=mZn2Xyh9Ec. Das, N., Chakraborty, S., Pacchiano, A., and Ray Chowd- hury, S. Active preference optimization for sample effi- cient rlhf.arXiv preprint arXiv:2402.10500, 2024. URL https://arxiv.org/abs/2402.10500. Ac- cepted at ECML-PKDD 2025. Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers v...

work page doi:10.1609/aaai.v38i18 2024
[4]

Gupta, A., Tang, S., Song, Q., Zhu, S., Hong, J., Saha, A., Gupta, V ., Lee, N., Kim, E., Zhu, S., Agrawal, P., Pillai, N

URL https://proceedings.mlr.press/ v202/gao23h.html. Gupta, A., Tang, S., Song, Q., Zhu, S., Hong, J., Saha, A., Gupta, V ., Lee, N., Kim, E., Zhu, S., Agrawal, P., Pillai, N. S., and Keerthi, S. AlphaPO: Reward shape mat- ters for LLM alignment. InForty-second International Conference on Machine Learning, 2025. URL https: //openreview.net/forum?id=LmdZ0p...

work page 2025
[5]

Liger Kernel: Efficient Triton Kernels for

URL https://openreview.net/forum? id=Tf4lRAOGkj. Hsu, P.-L., Dai, Y ., Kothapalli, V ., Song, Q., Tang, S., Zhu, S., Shimizu, S., Sahni, S., Ning, H., and Chen, Y . Liger kernel: Efficient triton kernels for llm training, 2024. URLhttps://arxiv.org/abs/2410.10989. Hu, J., Wu, X., Zhu, Z., Xianyu, Wang, W., Zhang, D., and Cao, Y . Openrlhf: An easy-to-use,...

work page arXiv 2024
[6]

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y

URL https://openreview.net/forum? id=kHO2ZTa8e3. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards im- proved safety alignment of LLM via a human-preference dataset. InThirty-seventh Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

work page
[7]

10 Bayesian Preference Learning for Test-Time Steerable Reward Models Joo, T., Chung, U., and Seo, M.-G

URL https://openreview.net/forum? id=g0QovXbFw3. 10 Bayesian Preference Learning for Test-Time Steerable Reward Models Joo, T., Chung, U., and Seo, M.-G. Being Bayesian about categorical probability. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Rese...

work page
[8]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

URL https://proceedings.mlr.press/ v119/joo20a.html. Kim, K., Seo, A. J., Liu, H., Shin, J., and Lee, K. Margin matching preference optimization: Enhanced model align- ment with granular feedback. InThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024a. URL https://openreview.net/forum? id=jmLKEtZsxN. Kim, S., Shin, J., Cho, Y ., ...

work page doi:10.1145/3600006.3613165 2024
[9]

ISBN 979-8-89176-195-7

Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology. org/2025.findings-naacl.96/. Lampinen, A. K., Chaudhry, A., Chan, S. C. Y ., Wild, C., Wan, D., Ku, A., Bornschein, J., Pascanu, R., Shana- han, M., and McClelland, J. L. On the generaliza- tion of language models from in-context learning and finetuning: a control...

work page arXiv 2025
[10]

Lin, B., Jiang, W., Xu, Y ., Chen, H., and Chen, Y .- C

URL https://openreview.net/forum? id=v8L0pN6EOi. Lin, B., Jiang, W., Xu, Y ., Chen, H., and Chen, Y .- C. PARM: Multi-objective test-time alignment via preference-aware autoregressive reward model. InForty- second International Conference on Machine Learning,

work page
[11]

URL https://openreview.net/forum? id=zm53HtGiXN. Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs, Octo- ber 2024a. URL http://arxiv.org/abs/2410. 18451. arXiv:2410.18451 [cs]. Liu, C. Y ., Zeng, L., Xiao, Y ., He, J., Liu, J., Wang, C., Yan, R., Shen, W.,...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

cc/paper_files/paper/2019/file/ f82798ec8909d23e55679ee26bb26437-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ f82798ec8909d23e55679ee26bb26437-Paper. pdf. 11 Bayesian Preference Learning for Test-Time Steerable Reward Models Lochab, A. and Zhang, R. Energy-based reward models for robust language model alignment.arXiv preprint arXiv:2504.13134, 2025. URL https://arxiv. org/abs/2504.13134. Loshchilov,...

work page arXiv 2019
[13]

Qwen2.5 Technical Report

URL https://openreview.net/forum? id=pXlmOmlHJZ. Park, J., Jwa, S., Meiying, R., Kim, D., and Choi, S. OffsetBias: Leveraging debiased data for tuning eval- uators. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.),Findings of the Association for Computa- tional Linguistics: EMNLP 2024, pp. 1043–1067, Mi- ami, Florida, USA, November 2024. Associati...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2024
[14]

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A

URL https://openreview.net/forum? id=vKLalvhcjz. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agar- wal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., ...

work page 2023
[15]

Featured Certification

URL https://openreview.net/forum? id=uyTL5Bvosj. Featured Certification. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020. Sun, H., Shen, Y ., and Ton, J.-F. Rethinking reward mod...

work page 2020
[16]

emnlp-main.1173/

URL https://openreview.net/forum? id=aKkAwZB6JV. V on Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J. a., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023. von Werra, L., Belkada, Y ., Tunstall, L., B...

work page doi:10.18653/v1/2024.findings-emnlp 2023
[17]

Reflection of Its Creators: Qualitative Analysis of General Public and Expert Perceptions of Artificial Intelligence

URL https://aclanthology.org/2024. findings-emnlp.620/. Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help- steer 2: Open-source dataset for training top-performing reward models. InThe Thirty-eight Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track, 20...

work page doi:10.1609/aies.v7i1 2024
[18]

URL https://proceedings.mlr.press/ v235/yuan24d.html. Zhao, Y ., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmai- son, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y ., Mathews, A., and Li, S. Pytorch fsdp: Expe- riences on scaling fully sharded data parallel.Proc. VLDB Endow., 1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/4235.797969 2023

[1] [1]

URLhttp://www.jstor.org/ stable/2334029

URL https://proceedings.mlr.press/ v162/bong22a.html. Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/ stable/2334029. Chen, C. and Smith, T. M. A bayes-type estimator for the bradley-terry model for paired comparison.J...

work page arXiv 1952

[2] [2]

URL https://openreview.net/forum? id=dliIIodM6b. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aha- roni, A., Lintz, N., Pais, T. C., Jacobsson, H., Szpektor, I., Jiang, N.-J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Das, N., Chakraborty, S., Pacchiano, A., and Ray Chowd- hury, S

URL https://openreview.net/forum? id=mZn2Xyh9Ec. Das, N., Chakraborty, S., Pacchiano, A., and Ray Chowd- hury, S. Active preference optimization for sample effi- cient rlhf.arXiv preprint arXiv:2402.10500, 2024. URL https://arxiv.org/abs/2402.10500. Ac- cepted at ECML-PKDD 2025. Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers v...

work page doi:10.1609/aaai.v38i18 2024

[4] [4]

Gupta, A., Tang, S., Song, Q., Zhu, S., Hong, J., Saha, A., Gupta, V ., Lee, N., Kim, E., Zhu, S., Agrawal, P., Pillai, N

URL https://proceedings.mlr.press/ v202/gao23h.html. Gupta, A., Tang, S., Song, Q., Zhu, S., Hong, J., Saha, A., Gupta, V ., Lee, N., Kim, E., Zhu, S., Agrawal, P., Pillai, N. S., and Keerthi, S. AlphaPO: Reward shape mat- ters for LLM alignment. InForty-second International Conference on Machine Learning, 2025. URL https: //openreview.net/forum?id=LmdZ0p...

work page 2025

[5] [5]

Liger Kernel: Efficient Triton Kernels for

URL https://openreview.net/forum? id=Tf4lRAOGkj. Hsu, P.-L., Dai, Y ., Kothapalli, V ., Song, Q., Tang, S., Zhu, S., Shimizu, S., Sahni, S., Ning, H., and Chen, Y . Liger kernel: Efficient triton kernels for llm training, 2024. URLhttps://arxiv.org/abs/2410.10989. Hu, J., Wu, X., Zhu, Z., Xianyu, Wang, W., Zhang, D., and Cao, Y . Openrlhf: An easy-to-use,...

work page arXiv 2024

[6] [6]

Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y

URL https://openreview.net/forum? id=kHO2ZTa8e3. Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., Chen, B., Sun, R., Wang, Y ., and Yang, Y . Beavertails: Towards im- proved safety alignment of LLM via a human-preference dataset. InThirty-seventh Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track,

work page

[7] [7]

10 Bayesian Preference Learning for Test-Time Steerable Reward Models Joo, T., Chung, U., and Seo, M.-G

URL https://openreview.net/forum? id=g0QovXbFw3. 10 Bayesian Preference Learning for Test-Time Steerable Reward Models Joo, T., Chung, U., and Seo, M.-G. Being Bayesian about categorical probability. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Rese...

work page

[8] [8]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

URL https://proceedings.mlr.press/ v119/joo20a.html. Kim, K., Seo, A. J., Liu, H., Shin, J., and Lee, K. Margin matching preference optimization: Enhanced model align- ment with granular feedback. InThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024a. URL https://openreview.net/forum? id=jmLKEtZsxN. Kim, S., Shin, J., Cho, Y ., ...

work page doi:10.1145/3600006.3613165 2024

[9] [9]

ISBN 979-8-89176-195-7

Association for Computational Linguistics. ISBN 979-8-89176-195-7. URL https://aclanthology. org/2025.findings-naacl.96/. Lampinen, A. K., Chaudhry, A., Chan, S. C. Y ., Wild, C., Wan, D., Ku, A., Bornschein, J., Pascanu, R., Shana- han, M., and McClelland, J. L. On the generaliza- tion of language models from in-context learning and finetuning: a control...

work page arXiv 2025

[10] [10]

Lin, B., Jiang, W., Xu, Y ., Chen, H., and Chen, Y .- C

URL https://openreview.net/forum? id=v8L0pN6EOi. Lin, B., Jiang, W., Xu, Y ., Chen, H., and Chen, Y .- C. PARM: Multi-objective test-time alignment via preference-aware autoregressive reward model. InForty- second International Conference on Machine Learning,

work page

[11] [11]

URL https://openreview.net/forum? id=zm53HtGiXN. Liu, C. Y ., Zeng, L., Liu, J., Yan, R., He, J., Wang, C., Yan, S., Liu, Y ., and Zhou, Y . Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs, Octo- ber 2024a. URL http://arxiv.org/abs/2410. 18451. arXiv:2410.18451 [cs]. Liu, C. Y ., Zeng, L., Xiao, Y ., He, J., Liu, J., Wang, C., Yan, R., Shen, W.,...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

cc/paper_files/paper/2019/file/ f82798ec8909d23e55679ee26bb26437-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ f82798ec8909d23e55679ee26bb26437-Paper. pdf. 11 Bayesian Preference Learning for Test-Time Steerable Reward Models Lochab, A. and Zhang, R. Energy-based reward models for robust language model alignment.arXiv preprint arXiv:2504.13134, 2025. URL https://arxiv. org/abs/2504.13134. Loshchilov,...

work page arXiv 2019

[13] [13]

Qwen2.5 Technical Report

URL https://openreview.net/forum? id=pXlmOmlHJZ. Park, J., Jwa, S., Meiying, R., Kim, D., and Choi, S. OffsetBias: Leveraging debiased data for tuning eval- uators. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .- N. (eds.),Findings of the Association for Computa- tional Linguistics: EMNLP 2024, pp. 1043–1067, Mi- ami, Florida, USA, November 2024. Associati...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024 2024

[14] [14]

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A

URL https://openreview.net/forum? id=vKLalvhcjz. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., Kluska, A., Lewkowycz, A., Agar- wal, A., Power, A., Ray, A., Warstadt, A., Kocurek, A. W., Safaya, A., Tazarv, A., Xiang, A., Parrish, A., Nie, A., Hussain, A., Askell, A., ...

work page 2023

[15] [15]

Featured Certification

URL https://openreview.net/forum? id=uyTL5Bvosj. Featured Certification. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., V oss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020. Sun, H., Shen, Y ., and Ton, J.-F. Rethinking reward mod...

work page 2020

[16] [16]

emnlp-main.1173/

URL https://openreview.net/forum? id=aKkAwZB6JV. V on Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J. a., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023. von Werra, L., Belkada, Y ., Tunstall, L., B...

work page doi:10.18653/v1/2024.findings-emnlp 2023

[17] [17]

Reflection of Its Creators: Qualitative Analysis of General Public and Expert Perceptions of Artificial Intelligence

URL https://aclanthology.org/2024. findings-emnlp.620/. Wang, Z., Dong, Y ., Delalleau, O., Zeng, J., Shen, G., Egert, D., Zhang, J. J., Sreedhar, M. N., and Kuchaiev, O. Help- steer 2: Open-source dataset for training top-performing reward models. InThe Thirty-eight Conference on Neu- ral Information Processing Systems Datasets and Bench- marks Track, 20...

work page doi:10.1609/aies.v7i1 2024

[18] [18]

URL https://proceedings.mlr.press/ v235/yuan24d.html. Zhao, Y ., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmai- son, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y ., Mathews, A., and Li, S. Pytorch fsdp: Expe- riences on scaling fully sharded data parallel.Proc. VLDB Endow., 1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/4235.797969 2023