Test-Time Alignment via Hypothesis Reweighting

Anikait Singh; Archit Sharma; Chelsea Finn; Eric Mitchell; Henrik Marklund; Jonathan Williams; Yoonho Lee

arxiv: 2412.08812 · v2 · submitted 2024-12-11 · 💻 cs.LG

Test-Time Alignment via Hypothesis Reweighting

Yoonho Lee , Jonathan Williams , Henrik Marklund , Archit Sharma , Eric Mitchell , Anikait Singh , Chelsea Finn This is my paper

Pith reviewed 2026-05-23 06:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords reward modelstest-time personalizationhypothesis reweightingpreference alignmentensemble methodsBayesian updatefew-shot adaptation

0 comments

The pith

Reweighting multiple heads in one reward model with a Bayesian update on 1-5 examples enables real-time personalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward models trained on pooled preferences often mismatch any single user's values. Fine-tuning or context conditioning for each user is too slow and expensive for on-the-fly use. The paper shows that training one network with several prediction heads lets the heads learn different valid readings of the same preference data. A Bayesian reweighting step then identifies and emphasizes the head that best fits a handful of new user examples. The result is accurate personalization that runs in a single forward pass with almost no added cost.

Core claim

HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences using only 1-5 labeled examples. This requires only a single forward pass with negligible computational overhead and yields substantial gains on personalization benchmarks.

What carries the argument

Multiple prediction heads whose outputs are combined by Bayesian reweighting on target preference pairs.

If this is right

With five preference pairs per target, HyRe exceeds prior state-of-the-art reward models on RewardBench at both 2B and 8B scale.
Accuracy rises by 20 percent across 32 distinct personalization tasks.
The entire adaptation step adds less than 1 percent compute because it uses one forward pass.
The approach applies across diverse target preference distributions without retraining the base network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Preference datasets may routinely contain multiple coherent value systems that a multi-head model can separate at training time.
Storing one multi-head network instead of many specialized models could simplify serving personalized reward functions to large user populations.
The same reweighting idea might transfer to other alignment settings that currently rely on per-user fine-tuning.

Load-bearing premise

The heads learn sufficiently distinct and valid interpretations of the training preferences so that a few target examples can reliably identify the right subset without overfitting.

What would settle it

On a held-out set of personalization tasks, reweighting the heads with five examples produces accuracy no higher than uniform averaging over the heads.

Figures

Figures reproduced from arXiv: 2412.08812 by Anikait Singh, Archit Sharma, Chelsea Finn, Eric Mitchell, Henrik Marklund, Jonathan Williams, Yoonho Lee.

**Figure 1.** Figure 1: The ensemble average is suboptimal in underspecified tasks. Performance of the uniform ensemble vs. the best individual model across four underspecified tasks (lower is better). In all cases, the best single head outperforms the uniform ensemble on the target distribution, highlighting the need for approaches that utilize additional information about the target distribution to optimize ensemble weighting.… view at source ↗

**Figure 3.** Figure 3: Performance of HYRE vs fine-tuning at different amounts of adaptation data. Ensemble reweighting outperforms fine-tuning in the low-data regime. Train Data w/ Conflicting Labelers Ensemble Predictions 0.00 0.15 0.30 0.45 0.60 0.75 0.90 1.05 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Diversity Coefficient 0.5 0.6 0.7 0.8 0.9 1.0 Held-Out Labeler Accuracy [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of an ensemble model trained on data with conflicting labels. (Left) The training [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of HYRE and few-shot fine-tuning on the Camelyon17 OOD test set. HYRE outperforms fine-tuning in the low-data regime despite requiring significantly less computational cost. Model Helpful Harmless Helpful Fine-Tune 73.03 32.59 Harmless Fine-Tune 32.06 73.30 Pretrained RM 68.01 52.16 Ensemble 66.34 50.90 + HYRE (Harmless) 68.44 51.21 + HYRE (Helpful) 64.24 57.66 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 6.** Figure 6: Average reward model accuracy across 18 target distributions from 3 dataset collections. For each [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of ensemble methods on RewardBench. N indicates number of adaptation samples. To train HYRE on preference data, we attach SharedBase ensemble heads to a pretrained 2B reward model and fine-tune it on the UltraFeedback (Cui et al., 2023) dataset, a standard dataset for reward model training. The base model, a fine-tuned version of Gemma-2B (Team et al., 2024), achieves state-ofthe-art accuracy … view at source ↗

**Figure 8.** Figure 8: Additional visualizations for the toy conflicting classification example. Increasing the scale hyperpa [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Detailed results for the personalizing preference reward models experiment in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyRe gives a low-overhead way to personalize reward models at test time by reweighting multiple heads, but the gains rest on an untested claim that those heads encode meaningfully different interpretations.

read the letter

HyRe trains a single reward model with multiple heads and then does a Bayesian reweighting of those heads using 1-5 target preference pairs. The result is a cheap inference-time adaptation step that the abstract says beats standard models on RewardBench at 2B and 8B scale and lifts accuracy 20% across 32 tasks, all with under 1% extra cost. That is the main new piece: a concrete recipe for turning an ensemble into a quick personalizer without fine-tuning or long context. The low overhead and the reported numbers on real benchmarks are the parts worth paying attention to; they show the method is at least implementable and produces measurable lifts in the regimes they tested. The soft spot is the load-bearing assumption that the heads actually capture distinct, valid interpretations of the data. The abstract invokes an empirical observation about head diversity, yet supplies no training protocol that forces separation, no diversity metric, and no ablation that compares reweighting against uniform averaging or against heads trained with explicit disagreement losses. If the heads are mostly correlated (as is common when they share the backbone and data), the Bayesian step on five examples risks either overfitting the small target set or collapsing to minor averaging. Without those controls, the 20% figure is hard to interpret as more than an engineering win under specific random seeds. This paper is for people working on RLHF and reward-model deployment who need cheap per-user or per-domain adaptation. A reader who already runs ensembles or test-time methods will see a straightforward extension and can judge whether the diversity assumption holds in their own setting. It deserves a serious referee because the empirical claims are specific and the overhead is low enough that the idea can be stress-tested quickly. Send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Hypothesis Reweighting (HyRe), which trains a single network with multiple prediction heads on preference data to capture different interpretations, then applies a Bayesian update to reweight heads using only 1-5 target preference pairs for test-time personalization of reward models. It reports that this yields state-of-the-art results on RewardBench at 2B/8B scales and a 20% accuracy improvement across 32 personalization tasks, with negligible inference overhead.

Significance. If the central empirical claims hold after addressing verification gaps, the method would offer a practical, low-cost route to user-specific alignment that avoids full fine-tuning or long-context conditioning. The approach builds on an ensemble idea but would benefit from explicit evidence that the heads encode sufficiently distinct, valid preference interpretations rather than correlated outputs.

major comments (2)

[Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.
[Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.

minor comments (2)

[Methods] Clarify the exact form of the Bayesian update (prior, likelihood, normalization) and whether it is performed in closed form or via sampling.
[Experiments] Provide the precise definition of the 32 personalization tasks, including how target distributions were constructed and whether they overlap with training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the verifiability of our empirical claims and the assumptions regarding head diversity. We address each major comment below and have revised the manuscript to incorporate additional details, ablations, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.

Authors: We agree that the abstract would benefit from explicit pointers to these elements for verifiability. The training protocol (multiple heads with distinct random initializations on aggregated preference data) is detailed in Section 3.2. Data splits follow the standard RewardBench and 32-task partitions described in Appendix A. Diversity is quantified via average pairwise prediction disagreement on held-out data (Table 2). Ablations versus uniform averaging appear in Figure 4 and Section 4.2. Multiple-comparison correction via Bonferroni is reported with adjusted p-values in Appendix C. We have revised the abstract to reference these sections and metrics. revision: yes
Referee: [Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.

Authors: We take this concern seriously. While heads share the backbone, experiments show that random initialization produces sufficiently distinct predictions, evidenced by the consistent 20% gains over uniform averaging. We have added Section 3.3 quantifying diversity via prediction variance and KL divergence across heads on target examples, demonstrating that diversity correlates with performance uplift. The Bayesian update employs a Dirichlet prior for regularization against overfitting on 1-5 examples. These additions provide the requested independent validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method rests on stated empirical observation treated as independent input

full rationale

The abstract and description present HyRe as building directly on an external empirical observation that multiple heads capture distinct valid interpretations, followed by a Bayesian reweighting step whose justification is the observation itself. No equations, self-citations, or internal derivations are supplied that would reduce the reweighting rule or performance claims to a fit on the target data by construction. The observation is invoked as a premise rather than derived or measured within the same closed loop as the evaluation, satisfying the criterion for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical premise that multiple heads learn meaningfully different interpretations; this premise is not derived from first principles and appears to be validated only on the training distribution used for the reported experiments. No explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5730 in / 1289 out tokens · 21162 ms · 2026-05-23T06:53:22.177305+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
cs.CL 2025-10 unverdicted novelty 5.0

POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context...

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Elix: Explain like i'm x - a dataset for personalized explanations

Anonymous. Elix: Explain like i'm x - a dataset for personalized explanations. 2024

work page 2024
[4]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Accessed via Claude.ai, API, and cloud platforms, 2024. URL https://www.anthropic.com. Enhanced reasoning, state-of-the-art coding skills, computer use, and 200K context window. Available on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI

work page 2024
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

P. G. Bissiri, C. C. Holmes, and S. G. Walker. A General Framework for Updating Belief Distributions . Journal of the Royal Statistical Society Series B: Statistical Methodology, 78 0 (5): 0 1103--1130, 02 2016. ISSN 1369-7412. doi:10.1111/rssb.12158

work page doi:10.1111/rssb.12158 2016
[7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Persona: A reproducible testbed for pluralistic alignment

Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fr \"a nken, and Chelsea Finn. Persona: A reproducible testbed for pluralistic alignment. arXiv preprint arXiv:2407.17387, 2024

work page arXiv 2024
[10]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[11]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Underspecification presents challenges for credibility in modern machine learning

Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23 0 (226): 0 1--61, 2022

work page 2022
[13]

Ensemble methods in machine learning

Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.\ 1--15. Springer, 2000

work page 2000
[14]

Efficient exploration for llms

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024

work page arXiv 2024
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

work page 2022
[16]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059. PMLR, 2016

work page 2016
[17]

Deep bayesian active learning with image data

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International conference on machine learning, pp.\ 1183--1192. PMLR, 2017

work page 2017
[18]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

work page 2023
[19]

Making pre-trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020

work page arXiv 2012
[20]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

work page 2020
[21]

Cooperative inverse reinforcement learning

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016

work page 2016
[22]

Neural network ensembles

Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12 0 (10): 0 993--1001, 1990

work page 1990
[23]

Bayesian Active Learning for Classification and Preference Learning

Neil Houlsby, Ferenc Husz \'a r, Zoubin Ghahramani, and M \'a t \'e Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[24]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.\ 2790--2799. PMLR, 2019

work page 2019
[25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Dangers of bayesian model averaging under covariate shift

Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, and Andrew G Wilson. Dangers of bayesian model averaging under covariate shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 3309--3322. Curran Associates, Inc., 2021

work page 2021
[27]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

work page 1991
[28]

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023
[29]

Reinforcement learning from human feedback with active queries

Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024

work page arXiv 2024
[30]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

D. Jimenez. Dynamically weighted ensemble neural networks for classification. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), volume 1, pp.\ 753--756 vol.1, 1998. doi:10.1109/IJCNN.1998.682375

work page doi:10.1109/ijcnn.1998.682375 1998
[32]

Hierarchical mixtures of experts and the em algorithm

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6 0 (2): 0 181--214, 1994

work page 1994
[33]

Uci machine learning repository

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. Uci machine learning repository. URL https://archive.ics.uci.edu. Accessed October 2024

work page 2024
[34]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp.\ 5637--5664. PMLR, 2021

work page 2021
[36]

Neural network ensembles, cross validation, and active learning

Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen (eds.), Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994

work page 1994
[37]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017

work page 2017
[38]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024

work page 2024
[39]

Urlb: Unsupervised reinforcement learning benchmark

Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021

work page arXiv 2021
[40]

Diversify and disambiguate: Learning from underspecified data

Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from underspecified data. International Conference on Learning Representations, 2023

work page 2023
[41]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[42]

Personalized language modeling from personalized human feedback

Xinyu Li, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024

work page arXiv 2024
[43]

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Active preference learning for large language models

William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024

work page arXiv 2024
[45]

Epistemic neural networks

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[46]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[47]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

work page 2019
[48]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. arXiv preprint arXiv:2408.10075, 2024

work page arXiv 2024
[50]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[51]

Optimizing ensemble weights and hyperparameters of machine learning models for regression problems

Mohsen Shahhosseini, Guiping Hu, and Hieu Pham. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. Machine Learning with Applications, 7: 0 100251, 2022

work page 2022
[52]

Do bayesian neural networks need to be fully stochastic?, 2023

Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, and Tom Rainforth. Do bayesian neural networks need to be fully stochastic?, 2023

work page 2023
[53]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Distributional preference learning: Understanding and accounting for hidden context in rlhf

Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023

work page arXiv 2023
[55]

Defining and characterizing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022

work page 2022
[56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization

Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 16761--16772, June 2022

work page 2022
[59]

Probabilistic principal component analysis

Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61 0 (3): 0 611--622, 1999

work page 1999
[60]

V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10 0 (5): 0 988--999, 1999. doi:10.1109/72.788640

work page doi:10.1109/72.788640 1999
[61]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020
[62]

A survey of preference-based reinforcement learning methods

Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F \"u rnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18 0 (136): 0 1--46, 2017

work page 2017
[63]

Reft: Representation finetuning for language models

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024

work page arXiv 2024
[64]

Regularizing hidden states enables learning generalizable reward model for llms

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024

work page arXiv 2024
[65]

Improving out-of-distribution robustness via selective augmentation

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.\ 25407--25437. PMLR, 2022

work page 2022
[66]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Twenty years of mixture of experts

Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23 0 (8): 0 1177--1193, 2012

work page 2012
[68]

Consequences of misaligned ai

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 15763--15773. Curran Associates, Inc., 2020

work page 2020

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Elix: Explain like i'm x - a dataset for personalized explanations

Anonymous. Elix: Explain like i'm x - a dataset for personalized explanations. 2024

work page 2024

[4] [4]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. Accessed via Claude.ai, API, and cloud platforms, 2024. URL https://www.anthropic.com. Enhanced reasoning, state-of-the-art coding skills, computer use, and 200K context window. Available on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI

work page 2024

[5] [5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

P. G. Bissiri, C. C. Holmes, and S. G. Walker. A General Framework for Updating Belief Distributions . Journal of the Royal Statistical Society Series B: Statistical Methodology, 78 0 (5): 0 1103--1130, 02 2016. ISSN 1369-7412. doi:10.1111/rssb.12158

work page doi:10.1111/rssb.12158 2016

[7] [7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Persona: A reproducible testbed for pluralistic alignment

Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fr \"a nken, and Chelsea Finn. Persona: A reproducible testbed for pluralistic alignment. arXiv preprint arXiv:2407.17387, 2024

work page arXiv 2024

[10] [10]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017

[11] [11]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Underspecification presents challenges for credibility in modern machine learning

Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23 0 (226): 0 1--61, 2022

work page 2022

[13] [13]

Ensemble methods in machine learning

Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.\ 1--15. Springer, 2000

work page 2000

[14] [14]

Efficient exploration for llms

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024

work page arXiv 2024

[15] [15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

work page 2022

[16] [16]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059. PMLR, 2016

work page 2016

[17] [17]

Deep bayesian active learning with image data

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International conference on machine learning, pp.\ 1183--1192. PMLR, 2017

work page 2017

[18] [18]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

work page 2023

[19] [19]

Making pre-trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020

work page arXiv 2012

[20] [20]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

work page 2020

[21] [21]

Cooperative inverse reinforcement learning

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016

work page 2016

[22] [22]

Neural network ensembles

Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12 0 (10): 0 993--1001, 1990

work page 1990

[23] [23]

Bayesian Active Learning for Classification and Preference Learning

Neil Houlsby, Ferenc Husz \'a r, Zoubin Ghahramani, and M \'a t \'e Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[24] [24]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.\ 2790--2799. PMLR, 2019

work page 2019

[25] [25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Dangers of bayesian model averaging under covariate shift

Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, and Andrew G Wilson. Dangers of bayesian model averaging under covariate shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 3309--3322. Curran Associates, Inc., 2021

work page 2021

[27] [27]

Adaptive mixtures of local experts

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991

work page 1991

[28] [28]

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023

[29] [29]

Reinforcement learning from human feedback with active queries

Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024

work page arXiv 2024

[30] [30]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

D. Jimenez. Dynamically weighted ensemble neural networks for classification. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), volume 1, pp.\ 753--756 vol.1, 1998. doi:10.1109/IJCNN.1998.682375

work page doi:10.1109/ijcnn.1998.682375 1998

[32] [32]

Hierarchical mixtures of experts and the em algorithm

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6 0 (2): 0 181--214, 1994

work page 1994

[33] [33]

Uci machine learning repository

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. Uci machine learning repository. URL https://archive.ics.uci.edu. Accessed October 2024

work page 2024

[34] [34]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp.\ 5637--5664. PMLR, 2021

work page 2021

[36] [36]

Neural network ensembles, cross validation, and active learning

Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen (eds.), Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994

work page 1994

[37] [37]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017

work page 2017

[38] [38]

Smith, and Hannaneh Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024

work page 2024

[39] [39]

Urlb: Unsupervised reinforcement learning benchmark

Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021

work page arXiv 2021

[40] [40]

Diversify and disambiguate: Learning from underspecified data

Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from underspecified data. International Conference on Learning Representations, 2023

work page 2023

[41] [41]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[42] [42]

Personalized language modeling from personalized human feedback

Xinyu Li, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024

work page arXiv 2024

[43] [43]

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Active preference learning for large language models

William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024

work page arXiv 2024

[45] [45]

Epistemic neural networks

Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[46] [46]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022

[47] [47]

Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek

Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...

work page 2019

[48] [48]

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. arXiv preprint arXiv:2408.10075, 2024

work page arXiv 2024

[50] [50]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[51] [51]

Optimizing ensemble weights and hyperparameters of machine learning models for regression problems

Mohsen Shahhosseini, Guiping Hu, and Hieu Pham. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. Machine Learning with Applications, 7: 0 100251, 2022

work page 2022

[52] [52]

Do bayesian neural networks need to be fully stochastic?, 2023

Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, and Tom Rainforth. Do bayesian neural networks need to be fully stochastic?, 2023

work page 2023

[53] [53]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Distributional preference learning: Understanding and accounting for hidden context in rlhf

Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023

work page arXiv 2023

[55] [55]

Defining and characterizing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022

work page 2022

[56] [56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization

Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 16761--16772, June 2022

work page 2022

[59] [59]

Probabilistic principal component analysis

Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61 0 (3): 0 611--622, 1999

work page 1999

[60] [60]

V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10 0 (5): 0 988--999, 1999. doi:10.1109/72.788640

work page doi:10.1109/72.788640 1999

[61] [61]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020

[62] [62]

A survey of preference-based reinforcement learning methods

Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F \"u rnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18 0 (136): 0 1--46, 2017

work page 2017

[63] [63]

Reft: Representation finetuning for language models

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024

work page arXiv 2024

[64] [64]

Regularizing hidden states enables learning generalizable reward model for llms

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024

work page arXiv 2024

[65] [65]

Improving out-of-distribution robustness via selective augmentation

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.\ 25407--25437. PMLR, 2022

work page 2022

[66] [66]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Twenty years of mixture of experts

Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23 0 (8): 0 1177--1193, 2012

work page 2012

[68] [68]

Consequences of misaligned ai

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 15763--15773. Curran Associates, Inc., 2020

work page 2020