Test-Time Alignment via Hypothesis Reweighting
Pith reviewed 2026-05-23 06:53 UTC · model grok-4.3
The pith
Reweighting multiple heads in one reward model with a Bayesian update on 1-5 examples enables real-time personalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences using only 1-5 labeled examples. This requires only a single forward pass with negligible computational overhead and yields substantial gains on personalization benchmarks.
What carries the argument
Multiple prediction heads whose outputs are combined by Bayesian reweighting on target preference pairs.
If this is right
- With five preference pairs per target, HyRe exceeds prior state-of-the-art reward models on RewardBench at both 2B and 8B scale.
- Accuracy rises by 20 percent across 32 distinct personalization tasks.
- The entire adaptation step adds less than 1 percent compute because it uses one forward pass.
- The approach applies across diverse target preference distributions without retraining the base network.
Where Pith is reading between the lines
- Preference datasets may routinely contain multiple coherent value systems that a multi-head model can separate at training time.
- Storing one multi-head network instead of many specialized models could simplify serving personalized reward functions to large user populations.
- The same reweighting idea might transfer to other alignment settings that currently rely on per-user fine-tuning.
Load-bearing premise
The heads learn sufficiently distinct and valid interpretations of the training preferences so that a few target examples can reliably identify the right subset without overfitting.
What would settle it
On a held-out set of personalization tasks, reweighting the heads with five examples produces accuracy no higher than uniform averaging over the heads.
Figures
read the original abstract
Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Hypothesis Reweighting (HyRe), which trains a single network with multiple prediction heads on preference data to capture different interpretations, then applies a Bayesian update to reweight heads using only 1-5 target preference pairs for test-time personalization of reward models. It reports that this yields state-of-the-art results on RewardBench at 2B/8B scales and a 20% accuracy improvement across 32 personalization tasks, with negligible inference overhead.
Significance. If the central empirical claims hold after addressing verification gaps, the method would offer a practical, low-cost route to user-specific alignment that avoids full fine-tuning or long-context conditioning. The approach builds on an ensemble idea but would benefit from explicit evidence that the heads encode sufficiently distinct, valid preference interpretations rather than correlated outputs.
major comments (2)
- [Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.
- [Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.
minor comments (2)
- [Methods] Clarify the exact form of the Bayesian update (prior, likelihood, normalization) and whether it is performed in closed form or via sampling.
- [Experiments] Provide the precise definition of the 32 personalization tasks, including how target distributions were constructed and whether they overlap with training data.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the verifiability of our empirical claims and the assumptions regarding head diversity. We address each major comment below and have revised the manuscript to incorporate additional details, ablations, and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that HyRe 'surpasses state-of-the-art reward models on RewardBench' and improves accuracy by 20% across 32 tasks with only five pairs is load-bearing for the contribution, yet the provided details omit training protocol for the heads, data splits, diversity metric, ablation against uniform averaging, and correction for multiple comparisons; without these the reported gains cannot be verified as robust rather than post-hoc or overfit.
Authors: We agree that the abstract would benefit from explicit pointers to these elements for verifiability. The training protocol (multiple heads with distinct random initializations on aggregated preference data) is detailed in Section 3.2. Data splits follow the standard RewardBench and 32-task partitions described in Appendix A. Diversity is quantified via average pairwise prediction disagreement on held-out data (Table 2). Ablations versus uniform averaging appear in Figure 4 and Section 4.2. Multiple-comparison correction via Bonferroni is reported with adjusted p-values in Appendix C. We have revised the abstract to reference these sections and metrics. revision: yes
-
Referee: [Abstract / Methods] The method relies on the assumption (stated in the abstract) that 'different heads capture different valid interpretations of preference data' so that Bayesian reweighting on 1-5 examples reliably selects the matching subset; however, when heads share the same backbone, loss, and data and differ only by random initialization, their predictions are likely correlated, reducing the reweighting step to unregularized averaging or overfitting on the small target set with no demonstrated regularization or independent diversity validation.
Authors: We take this concern seriously. While heads share the backbone, experiments show that random initialization produces sufficiently distinct predictions, evidenced by the consistent 20% gains over uniform averaging. We have added Section 3.3 quantifying diversity via prediction variance and KL divergence across heads on target examples, demonstrating that diversity correlates with performance uplift. The Bayesian update employs a Dirichlet prior for regularization against overfitting on 1-5 examples. These additions provide the requested independent validation. revision: yes
Circularity Check
No significant circularity; method rests on stated empirical observation treated as independent input
full rationale
The abstract and description present HyRe as building directly on an external empirical observation that multiple heads capture distinct valid interpretations, followed by a Bayesian reweighting step whose justification is the observation itself. No equations, self-citations, or internal derivations are supplied that would reduce the reweighting rule or performance claims to a fit on the target data by construction. The observation is invoked as a premise rather than derived or measured within the same closed loop as the evaluation, satisfying the criterion for a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Elix: Explain like i'm x - a dataset for personalized explanations
Anonymous. Elix: Explain like i'm x - a dataset for personalized explanations. 2024
work page 2024
-
[4]
Anthropic. Claude 3.5 sonnet. Accessed via Claude.ai, API, and cloud platforms, 2024. URL https://www.anthropic.com. Enhanced reasoning, state-of-the-art coding skills, computer use, and 200K context window. Available on Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI
work page 2024
-
[5]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
P. G. Bissiri, C. C. Holmes, and S. G. Walker. A General Framework for Updating Belief Distributions . Journal of the Royal Statistical Society Series B: Statistical Methodology, 78 0 (5): 0 1103--1130, 02 2016. ISSN 1369-7412. doi:10.1111/rssb.12158
-
[7]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Exploration by Random Network Distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Persona: A reproducible testbed for pluralistic alignment
Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fr \"a nken, and Chelsea Finn. Persona: A reproducible testbed for pluralistic alignment. arXiv preprint arXiv:2407.17387, 2024
-
[10]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[11]
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Underspecification presents challenges for credibility in modern machine learning
Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23 0 (226): 0 1--61, 2022
work page 2022
-
[13]
Ensemble methods in machine learning
Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.\ 1--15. Springer, 2000
work page 2000
-
[14]
Efficient exploration for llms
Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, and Benjamin Van Roy. Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024
-
[15]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022
work page 2022
-
[16]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp.\ 1050--1059. PMLR, 2016
work page 2016
-
[17]
Deep bayesian active learning with image data
Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In International conference on machine learning, pp.\ 1183--1192. PMLR, 2017
work page 2017
-
[18]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023
work page 2023
-
[19]
Making pre-trained language models better few-shot learners
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020
-
[20]
Shortcut learning in deep neural networks
Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020
work page 2020
-
[21]
Cooperative inverse reinforcement learning
Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016
work page 2016
-
[22]
Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12 0 (10): 0 993--1001, 1990
work page 1990
-
[23]
Bayesian Active Learning for Classification and Preference Learning
Neil Houlsby, Ferenc Husz \'a r, Zoubin Ghahramani, and M \'a t \'e Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[24]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.\ 2790--2799. PMLR, 2019
work page 2019
-
[25]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Dangers of bayesian model averaging under covariate shift
Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, and Andrew G Wilson. Dangers of bayesian model averaging under covariate shift. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 3309--3322. Curran Associates, Inc., 2021
work page 2021
-
[27]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3 0 (1): 0 79--87, 1991
work page 1991
-
[28]
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023
-
[29]
Reinforcement learning from human feedback with active queries
Kaixuan Ji, Jiafan He, and Quanquan Gu. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401, 2024
-
[30]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
D. Jimenez. Dynamically weighted ensemble neural networks for classification. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), volume 1, pp.\ 753--756 vol.1, 1998. doi:10.1109/IJCNN.1998.682375
-
[32]
Hierarchical mixtures of experts and the em algorithm
Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6 0 (2): 0 181--214, 1994
work page 1994
-
[33]
Uci machine learning repository
Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. Uci machine learning repository. URL https://archive.ics.uci.edu. Accessed October 2024
work page 2024
-
[34]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Wilds: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp.\ 5637--5664. PMLR, 2021
work page 2021
-
[36]
Neural network ensembles, cross validation, and active learning
Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen (eds.), Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994
work page 1994
-
[37]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017
work page 2017
-
[38]
Smith, and Hannaneh Hajishirzi
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024
work page 2024
-
[39]
Urlb: Unsupervised reinforcement learning benchmark
Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021
-
[40]
Diversify and disambiguate: Learning from underspecified data
Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from underspecified data. International Conference on Learning Representations, 2023
work page 2023
-
[41]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[42]
Personalized language modeling from personalized human feedback
Xinyu Li, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback. arXiv preprint arXiv:2402.05133, 2024
-
[43]
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Active preference learning for large language models
William Muldrew, Peter Hayes, Mingtian Zhang, and David Barber. Active preference learning for large language models. arXiv preprint arXiv:2402.08114, 2024
-
[45]
Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, and Benjamin Van Roy. Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2023
work page 2023
-
[46]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[47]
Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek
Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Inform...
work page 2019
-
[48]
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Personalizing reinforcement learning from human feedback with variational preference learning
Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. arXiv preprint arXiv:2408.10075, 2024
-
[50]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[51]
Optimizing ensemble weights and hyperparameters of machine learning models for regression problems
Mohsen Shahhosseini, Guiping Hu, and Hieu Pham. Optimizing ensemble weights and hyperparameters of machine learning models for regression problems. Machine Learning with Applications, 7: 0 100251, 2022
work page 2022
-
[52]
Do bayesian neural networks need to be fully stochastic?, 2023
Mrinank Sharma, Sebastian Farquhar, Eric Nalisnick, and Tom Rainforth. Do bayesian neural networks need to be fully stochastic?, 2023
work page 2023
-
[53]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Distributional preference learning: Understanding and accounting for hidden context in rlhf
Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023
-
[55]
Defining and characterizing reward gaming
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022
work page 2022
-
[56]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 16761--16772, June 2022
work page 2022
-
[59]
Probabilistic principal component analysis
Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61 0 (3): 0 611--622, 1999
work page 1999
-
[60]
V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10 0 (5): 0 988--999, 1999. doi:10.1109/72.788640
-
[61]
Trl: Transformer reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020
work page 2020
-
[62]
A survey of preference-based reinforcement learning methods
Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F \"u rnkranz. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18 0 (136): 0 1--46, 2017
work page 2017
-
[63]
Reft: Representation finetuning for language models
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models. arXiv preprint arXiv:2404.03592, 2024
-
[64]
Regularizing hidden states enables learning generalizable reward model for llms
Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024
-
[65]
Improving out-of-distribution robustness via selective augmentation
Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.\ 25407--25437. PMLR, 2022
work page 2022
-
[66]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Twenty years of mixture of experts
Seniha Esen Yuksel, Joseph N Wilson, and Paul D Gader. Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems, 23 0 (8): 0 1177--1193, 2012
work page 2012
-
[68]
Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 15763--15773. Curran Associates, Inc., 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.