pith. sign in

arxiv: 2412.08812 · v2 · submitted 2024-12-11 · 💻 cs.LG

Test-Time Alignment via Hypothesis Reweighting

Pith reviewed 2026-05-23 06:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords reward modelstest-time personalizationhypothesis reweightingpreference alignmentensemble methodsBayesian updatefew-shot adaptation
0
0 comments X

The pith

Reweighting multiple heads in one reward model with a Bayesian update on 1-5 examples enables real-time personalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward models trained on pooled preferences often mismatch any single user's values. Fine-tuning or context conditioning for each user is too slow and expensive for on-the-fly use. The paper shows that training one network with several prediction heads lets the heads learn different valid readings of the same preference data. A Bayesian reweighting step then identifies and emphasizes the head that best fits a handful of new user examples. The result is accurate personalization that runs in a single forward pass with almost no added cost.

Core claim

HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences using only 1-5 labeled examples. This requires only a single forward pass with negligible computational overhead and yields substantial gains on personalization benchmarks.

What carries the argument

Multiple prediction heads whose outputs are combined by Bayesian reweighting on target preference pairs.

If this is right

  • With five preference pairs per target, HyRe exceeds prior state-of-the-art reward models on RewardBench at both 2B and 8B scale.
  • Accuracy rises by 20 percent across 32 distinct personalization tasks.
  • The entire adaptation step adds less than 1 percent compute because it uses one forward pass.
  • The approach applies across diverse target preference distributions without retraining the base network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Preference datasets may routinely contain multiple coherent value systems that a multi-head model can separate at training time.
  • Storing one multi-head network instead of many specialized models could simplify serving personalized reward functions to large user populations.
  • The same reweighting idea might transfer to other alignment settings that currently rely on per-user fine-tuning.

Load-bearing premise

The heads learn sufficiently distinct and valid interpretations of the training preferences so that a few target examples can reliably identify the right subset without overfitting.

What would settle it

On a held-out set of personalization tasks, reweighting the heads with five examples produces accuracy no higher than uniform averaging over the heads.

Figures

Figures reproduced from arXiv: 2412.08812 by Anikait Singh, Archit Sharma, Chelsea Finn, Eric Mitchell, Henrik Marklund, Jonathan Williams, Yoonho Lee.

Figure 1
Figure 1. Figure 1: The ensemble average is suboptimal in underspecified tasks. Performance of the uniform ensem￾ble vs. the best individual model across four underspecified tasks (lower is better). In all cases, the best single head outperforms the uniform ensemble on the target distribution, highlighting the need for approaches that utilize additional information about the target distribution to optimize ensemble weighting.… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of HYRE vs fine-tuning at dif￾ferent amounts of adaptation data. Ensemble reweighting outperforms fine-tuning in the low-data regime. Train Data w/ Conflicting Labelers Ensemble Predictions 0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Diversity Coefficient 0.5 0.6 0.7 0.8 0.9 1.0 Held-Out Labeler Accuracy [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of an ensemble model trained on data with conflicting labels. (Left) The training [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of HYRE and few-shot fine-tuning on the Camelyon17 OOD test set. HYRE outperforms fine-tuning in the low-data regime despite requiring significantly less com￾putational cost. Model Helpful Harmless Helpful Fine-Tune 73.03 32.59 Harmless Fine-Tune 32.06 73.30 Pretrained RM 68.01 52.16 Ensemble 66.34 50.90 + HYRE (Harmless) 68.44 51.21 + HYRE (Helpful) 64.24 57.66 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 6
Figure 6. Figure 6: Average reward model accuracy across 18 target distributions from 3 dataset collections. For each [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of ensemble methods on RewardBench. N indicates number of adaptation samples. To train HYRE on preference data, we attach Shared￾Base ensemble heads to a pretrained 2B reward model and fine-tune it on the UltraFeedback (Cui et al., 2023) dataset, a standard dataset for reward model training. The base model, a fine-tuned version of Gemma-2B (Team et al., 2024), achieves state-of￾the-art accuracy … view at source ↗
Figure 8
Figure 8. Figure 8: Additional visualizations for the toy conflicting classification example. Increasing the scale hyperpa [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed results for the personalizing preference reward models experiment in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Hypothesis Reweighting (HyRe), which trains a single network with multiple prediction heads on preference data to capture different interpretations, then applies a Bayesian update to reweight heads using only 1-5 target preference pairs for test-time personalization of reward models. It reports that this yields state-of-the-art results on RewardBench at 2B/8B scales and a 20% accuracy improvement across 32 personalization tasks, with negligible inference overhead.

Significance. If the central empirical claims hold after addressing verification gaps, the method would offer a practical, low-cost route to user-specific alignment that avoids full fine-tuning or long-context conditioning. The approach builds on an ensemble idea but would benefit from explicit evidence that the heads encode sufficiently distinct, valid preference interpretations rather than correlated outputs.

major comments (2)
  1. [Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.
  2. [Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.
minor comments (2)
  1. [Methods] Clarify the exact form of the Bayesian update (prior, likelihood, normalization) and whether it is performed in closed form or via sampling.
  2. [Experiments] Provide the precise definition of the 32 personalization tasks, including how target distributions were constructed and whether they overlap with training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the verifiability of our empirical claims and the assumptions regarding head diversity. We address each major comment below and have revised the manuscript to incorporate additional details, ablations, and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.

    Authors: We agree that the abstract would benefit from explicit pointers to these elements for verifiability. The training protocol (multiple heads with distinct random initializations on aggregated preference data) is detailed in Section 3.2. Data splits follow the standard RewardBench and 32-task partitions described in Appendix A. Diversity is quantified via average pairwise prediction disagreement on held-out data (Table 2). Ablations versus uniform averaging appear in Figure 4 and Section 4.2. Multiple-comparison correction via Bonferroni is reported with adjusted p-values in Appendix C. We have revised the abstract to reference these sections and metrics. revision: yes

  2. Referee: [Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.

    Authors: We take this concern seriously. While heads share the backbone, experiments show that random initialization produces sufficiently distinct predictions, evidenced by the consistent 20% gains over uniform averaging. We have added Section 3.3 quantifying diversity via prediction variance and KL divergence across heads on target examples, demonstrating that diversity correlates with performance uplift. The Bayesian update employs a Dirichlet prior for regularization against overfitting on 1-5 examples. These additions provide the requested independent validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method rests on stated empirical observation treated as independent input

full rationale

The abstract and description present HyRe as building directly on an external empirical observation that multiple heads capture distinct valid interpretations, followed by a Bayesian reweighting step whose justification is the observation itself. No equations, self-citations, or internal derivations are supplied that would reduce the reweighting rule or performance claims to a fit on the target data by construction. The observation is invoked as a premise rather than derived or measured within the same closed loop as the evaluation, satisfying the criterion for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical premise that multiple heads learn meaningfully different interpretations; this premise is not derived from first principles and appears to be validated only on the training distribution used for the reported experiments. No explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5730 in / 1289 out tokens · 21162 ms · 2026-05-23T06:53:22.177305+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

    cs.CL 2025-10 unverdicted novelty 5.0

    POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context...

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Elix: Explain like i'm x - a dataset for personalized explanations

    Anonymous. Elix: Explain like i'm x - a dataset for personalized explanations. 2024

  4. [4]

    Claude 3.5 sonnet

    Anthropic. Claude 3.5 sonnet. Accessed via Claude.ai, API, and cloud platforms, 2024. URL https://www.anthropic.com. Enhanced reasoning, state-of-the-art coding skills, computer use, and 200K context window. Available on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI

  5. [5]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  6. [6]

    P. G. Bissiri, C. C. Holmes, and S. G. Walker. A General Framework for Updating Belief Distributions . Journal of the Royal Statistical Society Series B: Statistical Methodology, 78 0 (5): 0 1103--1130, 02 2016. ISSN 1369-7412. doi:10.1111/rssb.12158

  7. [7]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  8. [8]

    Exploration by Random Network Distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

  9. [9]

    Persona: A reproducible testbed for pluralistic alignment

    Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fr \"a nken, and Chelsea Finn. Persona: A reproducible testbed for pluralistic alignment. arXiv preprint arXiv:2407.17387, 2024

  10. [10]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  11. [11]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023

  12. [12]

    Underspecification presents challenges for credibility in modern machine learning

    Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23 0 (226): 0 1--61, 2022

  13. [13]

    Ensemble methods in machine learning

    Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.\ 1--15. Springer, 2000

  14. [14]

    Efficient exploration for llms

    Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  16. [16]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059. PMLR, 2016

  17. [17]

    Deep bayesian active learning with image data

    Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International conference on machine learning, pp.\ 1183--1192. PMLR, 2017

  18. [18]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

  19. [19]

    Making pre-trained language models better few-shot learners

    Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020

  20. [20]

    Shortcut learning in deep neural networks

    Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

  21. [21]

    Cooperative inverse reinforcement learning

    Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016

  22. [22]

    Neural network ensembles

    Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12 0 (10): 0 993--1001, 1990

  23. [23]

    Bayesian Active Learning for Classification and Preference Learning

    Neil Houlsby, Ferenc Husz \'a r, Zoubin Ghahramani, and M \'a t \'e Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011

  24. [24]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.\ 2790--2799. PMLR, 2019

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  26. [26]

    Dangers of bayesian model averaging under covariate shift

    Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, and Andrew G Wilson. Dangers of bayesian model averaging under covariate shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 3309--3322. Curran Associates, Inc., 2021

  27. [27]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

  28. [28]

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023

  29. [29]

    Reinforcement learning from human feedback with active queries

    Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024

  30. [30]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  31. [31]

    D. Jimenez. Dynamically weighted ensemble neural networks for classification. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), volume 1, pp.\ 753--756 vol.1, 1998. doi:10.1109/IJCNN.1998.682375

  32. [32]

    Hierarchical mixtures of experts and the em algorithm

    Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6 0 (2): 0 181--214, 1994

  33. [33]

    Uci machine learning repository

    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. Uci machine learning repository. URL https://archive.ics.uci.edu. Accessed October 2024

  34. [34]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

  35. [35]

    Wilds: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp.\ 5637--5664. PMLR, 2021

  36. [36]

    Neural network ensembles, cross validation, and active learning

    Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen (eds.), Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994

  37. [37]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017

  38. [38]

    Smith, and Hannaneh Hajishirzi

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024

  39. [39]

    Urlb: Unsupervised reinforcement learning benchmark

    Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021

  40. [40]

    Diversify and disambiguate: Learning from underspecified data

    Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from underspecified data. International Conference on Learning Representations, 2023

  41. [41]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

  42. [42]

    Personalized language modeling from personalized human feedback

    Xinyu Li, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024

  43. [43]

    DoRA: Weight-Decomposed Low-Rank Adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024

  44. [44]

    Active preference learning for large language models

    William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024

  45. [45]

    Epistemic neural networks

    Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2023

  46. [46]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  47. [47]

    Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

  48. [48]

    The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

  49. [49]

    Personalizing reinforcement learning from human feedback with variational preference learning

    Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. arXiv preprint arXiv:2408.10075, 2024

  50. [50]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  51. [51]

    Optimizing ensemble weights and hyperparameters of machine learning models for regression problems

    Mohsen Shahhosseini, Guiping Hu, and Hieu Pham. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. Machine Learning with Applications, 7: 0 100251, 2022

  52. [52]

    Do bayesian neural networks need to be fully stochastic?, 2023

    Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, and Tom Rainforth. Do bayesian neural networks need to be fully stochastic?, 2023

  53. [53]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  54. [54]

    Distributional preference learning: Understanding and accounting for hidden context in rlhf

    Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023

  55. [55]

    Defining and characterizing reward gaming

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022

  56. [56]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  57. [57]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  58. [58]

    Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization

    Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 16761--16772, June 2022

  59. [59]

    Probabilistic principal component analysis

    Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61 0 (3): 0 611--622, 1999

  60. [60]

    V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10 0 (5): 0 988--999, 1999. doi:10.1109/72.788640

  61. [61]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

  62. [62]

    A survey of preference-based reinforcement learning methods

    Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F \"u rnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18 0 (136): 0 1--46, 2017

  63. [63]

    Reft: Representation finetuning for language models

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024

  64. [64]

    Regularizing hidden states enables learning generalizable reward model for llms

    Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024

  65. [65]

    Improving out-of-distribution robustness via selective augmentation

    Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.\ 25407--25437. PMLR, 2022

  66. [66]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

  67. [67]

    Twenty years of mixture of experts

    Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23 0 (8): 0 1177--1193, 2012

  68. [68]

    Consequences of misaligned ai

    Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 15763--15773. Curran Associates, Inc., 2020