One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Fengbin Zhu; Fuli Feng; Hongru Cai; Tiezheng Yu; Wenjie Li; Wenjie Wang; Yongqi Li

arxiv: 2601.18731 · v2 · submitted 2026-01-26 · 💻 cs.CL · cs.AI

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Hongru Cai , Yongqi Li , Tiezheng Yu , Fengbin Zhu , Wenjie Wang , Fuli Feng , Wenjie Li This is my paper

Pith reviewed 2026-05-16 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords meta reward modelingpersonalized LLM alignmentfew-shot personalizationreward modelingMAML adaptationuser robustnesspreference adaptation

0 comments

The pith

Meta Reward Modeling adapts LLM reward functions to new users with just a few feedback examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard reward modeling struggles with personalized LLM alignment because each user provides too little feedback for reliable training. Instead of fitting a separate model per user, it reframes the task as learning the adaptation process itself through meta-learning. By representing every user's reward model as a weighted combination of shared base functions and meta-optimizing the starting weights, the system can adjust quickly when new feedback arrives. A Robust Personalization Objective further tilts training toward users whose preferences are hardest to capture. If the approach holds, personalized alignment becomes feasible at scale because the data burden per individual drops sharply.

Core claim

By casting personalized reward modeling as a meta-learning problem, MRM learns an initialization for the weights of base reward functions so that a small amount of new user feedback suffices to produce an effective individualized reward model. The method employs a MAML-style outer loop to optimize this initialization across many users and introduces the Robust Personalization Objective to emphasize hard-to-model users during meta-training, yielding consistent gains over non-meta baselines on personalized preference datasets.

What carries the argument

Meta Reward Modeling (MRM): a MAML-style meta-optimization that learns the initial weights of a linear combination of base reward functions so few-shot adaptation to an unseen user becomes reliable.

If this is right

Personalized reward models can be produced with far fewer user-specific labels than current per-user fitting requires.
Adaptation performance becomes more stable across users whose preferences deviate from the majority.
The same meta-initialization supports rapid switching between different base reward architectures without retraining from scratch.
Overall training cost for maintaining a fleet of personalized models decreases because most computation occurs once in the meta phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same meta-initialization trick could be applied to other alignment modules such as safety classifiers or response-style adapters.
Production systems could maintain a single meta-model and spin up per-user versions on demand with minimal storage overhead.
Future work might combine MRM with online user modeling so that the base functions themselves evolve as new preference patterns emerge across the population.

Load-bearing premise

User preferences can be expressed well enough as a weighted sum of a fixed collection of base reward functions whose starting weights can be meta-learned to work for future unseen users.

What would settle it

A controlled test in which, for held-out users, a reward model trained from the meta-learned initialization with k feedback examples performs no better than an identical model trained from random weights with the same k examples.

read the original abstract

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines. We release code at https://github.com/ModalityDance/MRM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MRM reframes personalized reward modeling as MAML-style meta-learning over linear combinations of base rewards plus a robust objective, which is a clean new angle but rests on an unproven span assumption.

read the letter

The main point is that this paper recasts the problem of building user-specific reward models as a meta-learning task. They treat each user's reward as a weighted sum of shared base reward functions, then use a MAML-style outer loop to learn good initial weights so that a few gradient steps on new user data produce a decent personalized model. The Robust Personalization Objective adds a focus on hard-to-fit users during meta-training. That formulation is the actual novelty relative to standard reward modeling or simple per-user fine-tuning, and the released code is a plus for reproducibility.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Meta Reward Modeling (MRM) to address data scarcity and adaptation challenges in personalized LLM alignment. It models each user's reward function as a linear combination of base reward functions and uses a MAML-style outer loop to meta-optimize the weight initializations for fast few-shot adaptation to unseen users. A Robust Personalization Objective (RPO) is introduced to emphasize hard-to-learn users during meta-training. Experiments on personalized preference datasets are reported to show that MRM improves few-shot personalization performance and user robustness while outperforming baselines.

Significance. If the empirical claims hold after detailed verification, the work could meaningfully advance scalable personalized alignment by shifting focus from per-user fitting to learning the adaptation process itself. The meta-learning framing and code release support reproducibility and potential follow-up. However, significance hinges on whether the linear-span assumption for user preferences is sufficiently expressive in practice and whether reported gains survive stronger controls for base-function construction and statistical robustness.

major comments (3)

[Method] Method section (MRM formulation): the central modeling choice r_u = sum w_i * r_base_i assumes user preferences lie (approximately) in the linear span of the chosen bases. No analysis, construction details for the bases, or empirical test is provided showing that this span is rich enough to cover non-linear or feature-interacting preferences; this assumption is load-bearing for the few-shot adaptation claim.
[Experiments] Experiments section: the abstract states consistent outperformance and improved robustness, yet no details appear on base-function construction, exact metrics (e.g., reward accuracy vs. win rate), number of adaptation shots, variance across runs, or statistical significance tests. Without these, it is impossible to judge whether gains are robust or attributable to MRM rather than hyper-parameter choices.
[Method] RPO description: the Robust Personalization Objective is presented as improving robustness to hard users, but no ablation isolating RPO from the base MAML procedure is reported. This leaves unclear whether the robustness gains are due to RPO or simply the meta-learning framework.

minor comments (2)

[Abstract] Abstract: the number of base reward functions and the concrete datasets used should be stated explicitly to give readers an immediate sense of experimental scope.
[Method] Notation: ensure the distinction between meta-training and meta-testing users is introduced with consistent symbols before the first equations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, provide missing details, and strengthen the empirical support.

read point-by-point responses

Referee: [Method] Method section (MRM formulation): the central modeling choice r_u = sum w_i * r_base_i assumes user preferences lie (approximately) in the linear span of the chosen bases. No analysis, construction details for the bases, or empirical test is provided showing that this span is rich enough to cover non-linear or feature-interacting preferences; this assumption is load-bearing for the few-shot adaptation claim.

Authors: We appreciate the referee highlighting the importance of the linear-span assumption. The formulation uses a linear combination to enable tractable meta-optimization of the weight initializations via MAML. In the manuscript the base functions are constructed as a diverse collection of reward heads trained on clustered subsets of public preference data to capture varied preference dimensions. To directly address the concern we will add a dedicated subsection in Methods that (i) details the base-construction procedure, (ii) provides a brief theoretical motivation for why the linear span can approximate a wide class of preference functions when the bases are sufficiently diverse, and (iii) reports an auxiliary experiment on synthetic non-linear preference data demonstrating that the span covers the majority of test cases with low reconstruction error. These additions will make the modeling choice and its practical scope explicit. revision: yes
Referee: [Experiments] Experiments section: the abstract states consistent outperformance and improved robustness, yet no details appear on base-function construction, exact metrics (e.g., reward accuracy vs. win rate), number of adaptation shots, variance across runs, or statistical significance tests. Without these, it is impossible to judge whether gains are robust or attributable to MRM rather than hyper-parameter choices.

Authors: We agree that the current Experiments section lacks sufficient implementation and statistical detail. We will expand it to explicitly describe base-function construction (the same procedure referenced above), the primary evaluation metric (win rate on held-out user preferences, with reward-model accuracy reported as a secondary metric), the adaptation-shot regimes (1-shot, 5-shot, and 10-shot), performance variance (standard deviation across five independent random seeds), and statistical significance (paired t-tests against each baseline with p-values). These revisions will allow readers to verify that the reported improvements are attributable to MRM rather than hyper-parameter tuning. revision: yes
Referee: [Method] RPO description: the Robust Personalization Objective is presented as improving robustness to hard users, but no ablation isolating RPO from the base MAML procedure is reported. This leaves unclear whether the robustness gains are due to RPO or simply the meta-learning framework.

Authors: We acknowledge that an ablation isolating RPO is necessary. We will add a new ablation subsection in Experiments that compares the full MRM objective against a standard MAML baseline without the robust weighting term. The comparison will report both average and worst-case personalization performance across users, thereby quantifying the incremental benefit of RPO for robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity: meta-learning procedure is independent of fitted inputs

full rationale

The derivation presents MRM as a distinct MAML-style outer-loop optimization over initializations of weights in a linear combination of base reward functions. This is a new training procedure for fast adaptation rather than a re-expression or renaming of any fitted parameter or input data. No self-citations are invoked as load-bearing uniqueness theorems, no predictions reduce by construction to the meta-training set, and the central claims rest on empirical outperformance on held-out users. The formulation is self-contained against external benchmarks and does not collapse to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only; free parameters and axioms cannot be exhaustively audited without the methods section. The approach implicitly assumes linear combination of base rewards and meta-generalization across users.

free parameters (1)

meta-learning hyperparameters
MAML-style inner and outer learning rates and adaptation steps are standard free parameters whose specific values are not stated in the abstract.

axioms (1)

domain assumption User preferences can be represented as weighted combinations of shared base reward functions
Stated in the abstract as the representation used for each user's reward model.

pith-pipeline@v0.9.0 · 5522 in / 1157 out tokens · 17706 ms · 2026-05-16T10:55:26.118491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages

[1]

Aligning large language models with human: A survey, 2023

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey, 2023

work page 2023
[2]

A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2025

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

work page 2025
[3]

A survey on personalized Alignment— The missing piece for large language models in real- world applications

Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. A survey on personalized Alignment— The missing piece for large language models in real- world applications. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[4]

Posi- tion: a roadmap to pluralistic alignment

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. Posi- tion: a roadmap to pluralistic alignment. InProceed- ings of the 41st International Conference on Machine Learning, ICML’24, 2024. 13 One Adapts to Any: Meta...

work page 2024
[5]

Personalization of large language models: A survey

ZhehaoZhang,RyanA.Rossi,BranislavKveton,Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernon- court, Joe Barrow, Tong Yu, Sungchul Kim, et al. Personalization of large language models: A survey. Transactions on Machine Learning Research, 2025. Survey Certification

work page 2025
[6]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InPro- ceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, 2017

work page 2017
[7]

Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022

work page 2022
[8]

Traininglanguagemodelstofollowinstructionswith human feedback.Advances in neural information processing systems, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Traininglanguagemodelstofollowinstructionswith human feedback.Advances in neural information processing systems, 2022

work page 2022
[9]

Towards harmless multimodal assistants with blind preference optimization, 2025

Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, and Liqiang Nie. Towards harmless multimodal assistants with blind preference optimization, 2025

work page 2025
[10]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023

work page 2023
[11]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Ab- hishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[12]

PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment

Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak. PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment. InThe Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025
[13]

Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang

Michael J. Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang. Syn- thesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025
[14]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025

work page 2025
[15]

Lore: Personalizing llms via low-rank reward modeling, 2025

Avinandan Bose, Zhihan Xiong, Yuejie Chi, Si- mon Shaolei Du, Lin Xiao, and Maryam Fazel. Lore: Personalizing llms via low-rank reward modeling, 2025

work page 2025
[16]

Guided profile generation improves personalization with large language models

Jiarui Zhang. Guided profile generation improves personalization with large language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024
[17]

Group preference optimization: Few-shot alignment of large language models

Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[18]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. InThirty- seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[19]

Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

work page 2023
[20]

Language model personalization via reward factorization

Idan Shenfeld, Felix Faltings, Pulkit Agrawal, and Aldo Pacchiano. Language model personalization via reward factorization. InSecond Conference on Language Modeling, 2025

work page 2025
[21]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th Inter- national Conference on Machine Learning - Volume 70, ICML’17, 2017

work page 2017
[22]

A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025

Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, HuaLu, YichengChen, YangZhang, WeiZhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025

work page 2025
[23]

Internlm2 techni- cal report, 2024

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, et al. Internlm2 techni- cal report, 2024

work page 2024
[24]

Advancing LLM reasoning generalists withpreferencetrees

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Boji Shan, Zeyuan Liu, Jia Deng, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing LLM reasoning generalists withpreferencetrees. InTheThirteenthInternational Conference on Learning Representations, 2025

work page 2025
[25]

Skywork-reward: Bag of tricks for reward modeling in llms, 2024

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, 14 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms, 2024

work page 2024
[26]

Interpretable preferences via multi-objective reward modeling and mixture-of- experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of- experts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, 2024

work page 2024
[27]

Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica. Judgingllm-as-a- judge with mt-bench and chatbot arena. InProceed- ings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, 2023

work page 2023
[28]

Generative judge for evaluating alignment

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. Generative judge for evaluating alignment. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[29]

Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024

work page 2024
[30]

Learn- ing LLM-as-a-judge for preference alignment

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun LIU. Learn- ing LLM-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[31]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khal- man, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feed- back, 2023

work page 2023
[32]

From $r$ to $q^*$: Your language model is secretlyaq-function

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From $r$ to $q^*$: Your language model is secretlyaq-function. InFirstConferenceonLanguage Modeling, 2024

work page 2024
[33]

A general theo- retical paradigm to understand learning from hu- man preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theo- retical paradigm to understand learning from hu- man preferences. InProceedings of The 27th In- ternational Conference on Artificial Intelligence and Statistics, 2024

work page 2024
[34]

Model alignment as prospecttheoreticoptimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospecttheoreticoptimization. InProceedingsofthe 41st International Conference on Machine Learning, ICML’24, 2024

work page 2024
[35]

Starling-7b: Improving helpfulness and harmlessness with RLAIF

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao. Starling-7b: Improving helpfulness and harmlessness with RLAIF. InFirst Conference on Language Modeling, 2024

work page 2024
[36]

Regularizing hidden states en- ables learning generalizable reward model for LLMs

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states en- ables learning generalizable reward model for LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[37]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InProceedings of the 61st Annual Meeting of theAssociationforComputationalLinguistics(Volume 1: Long Papers), 2023

work page 2023
[38]

Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins

Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solvingmathwordproblemswithprocess-basedand outcome-based feedback, 2023

work page 2023
[39]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Har- rison Edwards, Bowen Baker, Teddy Lee, Jan Leike, JohnSchulman,IlyaSutskever,andKarlCobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024

work page 2024
[40]

Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024
[41]

Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision

ZihanWang,YunxuanLi,YuexinWu,LiangchenLuo, Le Hou, Hongkun Yu, and Jingbo Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, 2024

work page 2024
[42]

Two tales of persona in LLMs: A survey of role-playing and personalization

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun- Nung Chen. Two tales of persona in LLMs: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024
[43]

Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea

Charles Welch, Chenxi Gu, Jonathan K. Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea. Lever- aging similar users for personalized language mod- eling with limited data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

work page 2022
[44]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy (SP), 2017

work page 2017
[45]

Evaluating approaches to personalizing language models

Milton King and Paul Cook. Evaluating approaches to personalizing language models. InProceedings of the Twelfth Language Resources and Evaluation Conference, 2020. 15 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

work page 2020
[46]

Andrew Schwartz

Nikita Soni, Matthew Matero, Niranjan Balasubra- manian, and H. Andrew Schwartz. Human language modeling. InFindings of the Association for Compu- tational Linguistics: ACL 2022, 2022

work page 2022
[47]

UserIdentifier: Im- plicit user representations for simple and effective personalized sentiment analysis

Fatemehsadat Mireshghallah, Vaishnavi Shrivastava, Milad Shokouhi, Taylor Berg-Kirkpatrick, Robert Sim, and Dimitrios Dimitriadis. UserIdentifier: Im- plicit user representations for simple and effective personalized sentiment analysis. InProceedings of the 2022 Conference of the North American Chapter of theAssociationforComputationalLinguistics: Human L...

work page 2022
[48]

Large language models empowered personalized web agents

Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. Large language models empowered personalized web agents. InProceedings of the ACM on Web Conference 2025, WWW ’25, 2025

work page 2025
[49]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personaliz- ing dialogue agents: I have a dog, do you have pets too? InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018

work page 2018
[50]

Training millions of personalized dialogue agents

Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. Training millions of personalized dialogue agents. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[51]

Less is more: Learning to refine dialogue history for personalized dialogue generation

Hanxun Zhong, Zhicheng Dou, Yutao Zhu, Hongjin Qian, and Ji-Rong Wen. Less is more: Learning to refine dialogue history for personalized dialogue generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, 2022

work page 2022
[52]

Health-llm: Personalized retrieval- augmented disease prediction system, 2025

Qinkai Yu, Mingyu Jin, Dong Shu, Chong Zhang, Lizhou Fan, Wenyue Hua, Suiyuan Zhu, Yanda Meng, Zhenting Wang, Mengnan Du, and Yongfeng Zhang. Health-llm: Personalized retrieval- augmented disease prediction system, 2025

work page 2025
[53]

Enhancing video- based learning using knowledge tracing: Personaliz- ing students’ learning experience with ORBITS

Shady Shehata, David Santandreu Calonge, Philip Purnell, and Mark Thompson. Enhancing video- based learning using knowledge tracing: Personaliz- ing students’ learning experience with ORBITS. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023

work page 2023
[54]

Tidy- bot: personalized robot assistance with large lan- guage models.Auton

Jimmy Wu, Rika Antonova, Adam Kan, Marion Lep- ert, Andy Zeng, Shuran Song, Jeannette Bohg, Szy- mon Rusinkiewicz, and Thomas Funkhouser. Tidy- bot: personalized robot assistance with large lan- guage models.Auton. Robots, 2023

work page 2023
[55]

Meta-learning with memory-augmented neural networks

AdamSantoro, SergeyBartunov, MatthewBotvinick, DaanWierstra, andTimothyLillicrap. Meta-learning with memory-augmented neural networks. InPro- ceedings of the 33rd International Conference on In- ternational Conference on Machine Learning - Volume 48, ICML’16, 2016

work page 2016
[56]

Pro- totypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Pro- totypical networks for few-shot learning. InProceed- ings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017

work page 2017
[57]

A simple neural attentive meta- learner

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta- learner. InInternational Conference on Learning Rep- resentations, 2018

work page 2018
[58]

Meta-learning via language model in-context tuning

Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

work page 2022
[59]

MetaICL: Learning to learn in con- text

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi. MetaICL: Learning to learn in con- text. InProceedings of the 2022 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, 2022

work page 2022
[60]

Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016

YanDuan, JohnSchulman, XiChen, PeterL.Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016

work page 2016
[61]

Learning to reinforcement learn, 2017

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn, 2017

work page 2017
[62]

Personalizing dialogue agents via meta-learning

Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-learning. InProceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[63]

Jalali, Varun Bharill, Yunbo Ouyang, Aastha Nigam, Divya Venugopalan, Aman Gupta, Fedor Borisyuk, Sathiya Keerthi, and AjithMuralidharan

Ruofan Wang, Prakruthi Prabhakar, Gaurav Srivas- tava, Tianqi Wang, Zeinab S. Jalali, Varun Bharill, Yunbo Ouyang, Aastha Nigam, Divya Venugopalan, Aman Gupta, Fedor Borisyuk, Sathiya Keerthi, and AjithMuralidharan. Limaml: Personalizationofdeep recommender models via meta learning. InProceed- ings of the 30th ACM SIGKDD Conference on Knowl- edge Discover...

work page 2024
[64]

A meta-learning perspective on cold-start recommen- dations for items

Manasi Vartak, Arvind Thiagarajan, Conrado Mi- randa, Jeshua Bratman, and Hugo Larochelle. A meta-learning perspective on cold-start recommen- dations for items. InProceedings of the 31st Interna- tional Conference on Neural Information Processing Systems, NIPS’17, 2017

work page 2017
[65]

M2eu: Meta learn- ing for cold-start recommendation via enhancing 16 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment user preference estimation

Zhenchao Wu and Xiao Zhou. M2eu: Meta learn- ing for cold-start recommendation via enhancing 16 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment user preference estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, 2023

work page 2023
[66]

Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation

Minchang Kim, Yongjin Yang, Jung Hyun Ryu, and Taesup Kim. Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation. In Proceedingsofthe32ndACMInternationalConference on Information and Knowledge Management, CIKM ’23, 2023

work page 2023
[67]

Melu: Meta-learned user preference estimator for cold-start recommendation

Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 2019

work page 2019
[68]

Personal- LLM: Tailoring LLMs to individual preferences

Thomas P Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong. Personal- LLM: Tailoring LLMs to individual preferences. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[69]

Meta-learning for cold-start personalization in prompt-tuned llms, 2025

Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, and Yinuo Yang. Meta-learning for cold-start personalization in prompt-tuned llms, 2025

work page 2025
[70]

FSPO: Few-shot prefer- enceoptimizationofsyntheticpreferencedataelicits LLM personalization to real users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. FSPO: Few-shot prefer- enceoptimizationofsyntheticpreferencedataelicits LLM personalization to real users. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025

work page 2025
[71]

Meta-learning with task-adaptive loss function for few-shot learn- ing

SungyongBaik,JanghoonChoi,HeewonKim,Dohee Cho, Jaesik Min, and Kyoung Mu Lee. Meta-learning with task-adaptive loss function for few-shot learn- ing. InProceedings of the IEEE/CVF international conference on computer vision, 2021

work page 2021
[72]

Meta learning via learned loss

Sarah Bechtle, Artem Molchanov, Yevgen Chebo- tar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. InInternational Conference on Pattern Recognition, ICPR, Italy, January 10-15, 2021, 2021

work page 2021
[73]

Rank anal- ysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank anal- ysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[74]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Michael Bean, Katerina Margatina, Rafael Mosquera, Juan Manuel Ciro, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participa- tory, representative and individualised human feed- back reveals about the subjective and multicultural alignmen...

work page 2024
[75]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InProceedings of the 34th International Conference on Neural Informa- tion Processing Systems, NIPS ’20, 2020

work page 2020
[76]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The llama 3 herd of models, 2024

work page 2024
[77]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2015. 17

work page 2015

[1] [1]

Aligning large language models with human: A survey, 2023

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey, 2023

work page 2023

[2] [2]

A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2025

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2025. ISSN 2835-8856

work page 2025

[3] [3]

A survey on personalized Alignment— The missing piece for large language models in real- world applications

Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. A survey on personalized Alignment— The missing piece for large language models in real- world applications. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[4] [4]

Posi- tion: a roadmap to pluralistic alignment

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. Posi- tion: a roadmap to pluralistic alignment. InProceed- ings of the 41st International Conference on Machine Learning, ICML’24, 2024. 13 One Adapts to Any: Meta...

work page 2024

[5] [5]

Personalization of large language models: A survey

ZhehaoZhang,RyanA.Rossi,BranislavKveton,Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernon- court, Joe Barrow, Tong Yu, Sungchul Kim, et al. Personalization of large language models: A survey. Transactions on Machine Learning Research, 2025. Survey Certification

work page 2025

[6] [6]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InPro- ceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, 2017

work page 2017

[7] [7]

Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022

work page 2022

[8] [8]

Traininglanguagemodelstofollowinstructionswith human feedback.Advances in neural information processing systems, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Traininglanguagemodelstofollowinstructionswith human feedback.Advances in neural information processing systems, 2022

work page 2022

[9] [9]

Towards harmless multimodal assistants with blind preference optimization, 2025

Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, and Liqiang Nie. Towards harmless multimodal assistants with blind preference optimization, 2025

work page 2025

[10] [10]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023

work page 2023

[11] [11]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Ab- hishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[12] [12]

PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment

Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak. PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment. InThe Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025

[13] [13]

Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang

Michael J. Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang. Syn- thesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025

[14] [14]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025

work page 2025

[15] [15]

Lore: Personalizing llms via low-rank reward modeling, 2025

Avinandan Bose, Zhihan Xiong, Yuejie Chi, Si- mon Shaolei Du, Lin Xiao, and Maryam Fazel. Lore: Personalizing llms via low-rank reward modeling, 2025

work page 2025

[16] [16]

Guided profile generation improves personalization with large language models

Jiarui Zhang. Guided profile generation improves personalization with large language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024

[17] [17]

Group preference optimization: Few-shot alignment of large language models

Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[18] [18]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. InThirty- seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[19] [19]

Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

work page 2023

[20] [20]

Language model personalization via reward factorization

Idan Shenfeld, Felix Faltings, Pulkit Agrawal, and Aldo Pacchiano. Language model personalization via reward factorization. InSecond Conference on Language Modeling, 2025

work page 2025

[21] [21]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th Inter- national Conference on Machine Learning - Volume 70, ICML’17, 2017

work page 2017

[22] [22]

A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025

Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, HuaLu, YichengChen, YangZhang, WeiZhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025

work page 2025

[23] [23]

Internlm2 techni- cal report, 2024

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, et al. Internlm2 techni- cal report, 2024

work page 2024

[24] [24]

Advancing LLM reasoning generalists withpreferencetrees

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Boji Shan, Zeyuan Liu, Jia Deng, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing LLM reasoning generalists withpreferencetrees. InTheThirteenthInternational Conference on Learning Representations, 2025

work page 2025

[25] [25]

Skywork-reward: Bag of tricks for reward modeling in llms, 2024

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, 14 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms, 2024

work page 2024

[26] [26]

Interpretable preferences via multi-objective reward modeling and mixture-of- experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of- experts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, 2024

work page 2024

[27] [27]

Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica. Judgingllm-as-a- judge with mt-bench and chatbot arena. InProceed- ings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, 2023

work page 2023

[28] [28]

Generative judge for evaluating alignment

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. Generative judge for evaluating alignment. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[29] [29]

Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024

work page 2024

[30] [30]

Learn- ing LLM-as-a-judge for preference alignment

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun LIU. Learn- ing LLM-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[31] [31]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khal- man, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feed- back, 2023

work page 2023

[32] [32]

From $r$ to $q^*$: Your language model is secretlyaq-function

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From $r$ to $q^*$: Your language model is secretlyaq-function. InFirstConferenceonLanguage Modeling, 2024

work page 2024

[33] [33]

A general theo- retical paradigm to understand learning from hu- man preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theo- retical paradigm to understand learning from hu- man preferences. InProceedings of The 27th In- ternational Conference on Artificial Intelligence and Statistics, 2024

work page 2024

[34] [34]

Model alignment as prospecttheoreticoptimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospecttheoreticoptimization. InProceedingsofthe 41st International Conference on Machine Learning, ICML’24, 2024

work page 2024

[35] [35]

Starling-7b: Improving helpfulness and harmlessness with RLAIF

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao. Starling-7b: Improving helpfulness and harmlessness with RLAIF. InFirst Conference on Language Modeling, 2024

work page 2024

[36] [36]

Regularizing hidden states en- ables learning generalizable reward model for LLMs

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states en- ables learning generalizable reward model for LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[37] [37]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InProceedings of the 61st Annual Meeting of theAssociationforComputationalLinguistics(Volume 1: Long Papers), 2023

work page 2023

[38] [38]

Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins

Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solvingmathwordproblemswithprocess-basedand outcome-based feedback, 2023

work page 2023

[39] [39]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Har- rison Edwards, Bowen Baker, Teddy Lee, Jan Leike, JohnSchulman,IlyaSutskever,andKarlCobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024

work page 2024

[40] [40]

Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

work page 2024

[41] [41]

Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision

ZihanWang,YunxuanLi,YuexinWu,LiangchenLuo, Le Hou, Hongkun Yu, and Jingbo Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, 2024

work page 2024

[42] [42]

Two tales of persona in LLMs: A survey of role-playing and personalization

Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun- Nung Chen. Two tales of persona in LLMs: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024

[43] [43]

Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea

Charles Welch, Chenxi Gu, Jonathan K. Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea. Lever- aging similar users for personalized language mod- eling with limited data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

work page 2022

[44] [44]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy (SP), 2017

work page 2017

[45] [45]

Evaluating approaches to personalizing language models

Milton King and Paul Cook. Evaluating approaches to personalizing language models. InProceedings of the Twelfth Language Resources and Evaluation Conference, 2020. 15 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

work page 2020

[46] [46]

Andrew Schwartz

Nikita Soni, Matthew Matero, Niranjan Balasubra- manian, and H. Andrew Schwartz. Human language modeling. InFindings of the Association for Compu- tational Linguistics: ACL 2022, 2022

work page 2022

[47] [47]

UserIdentifier: Im- plicit user representations for simple and effective personalized sentiment analysis

Fatemehsadat Mireshghallah, Vaishnavi Shrivastava, Milad Shokouhi, Taylor Berg-Kirkpatrick, Robert Sim, and Dimitrios Dimitriadis. UserIdentifier: Im- plicit user representations for simple and effective personalized sentiment analysis. InProceedings of the 2022 Conference of the North American Chapter of theAssociationforComputationalLinguistics: Human L...

work page 2022

[48] [48]

Large language models empowered personalized web agents

Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. Large language models empowered personalized web agents. InProceedings of the ACM on Web Conference 2025, WWW ’25, 2025

work page 2025

[49] [49]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personaliz- ing dialogue agents: I have a dog, do you have pets too? InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018

work page 2018

[50] [50]

Training millions of personalized dialogue agents

Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. Training millions of personalized dialogue agents. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018

[51] [51]

Less is more: Learning to refine dialogue history for personalized dialogue generation

Hanxun Zhong, Zhicheng Dou, Yutao Zhu, Hongjin Qian, and Ji-Rong Wen. Less is more: Learning to refine dialogue history for personalized dialogue generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, 2022

work page 2022

[52] [52]

Health-llm: Personalized retrieval- augmented disease prediction system, 2025

Qinkai Yu, Mingyu Jin, Dong Shu, Chong Zhang, Lizhou Fan, Wenyue Hua, Suiyuan Zhu, Yanda Meng, Zhenting Wang, Mengnan Du, and Yongfeng Zhang. Health-llm: Personalized retrieval- augmented disease prediction system, 2025

work page 2025

[53] [53]

Enhancing video- based learning using knowledge tracing: Personaliz- ing students’ learning experience with ORBITS

Shady Shehata, David Santandreu Calonge, Philip Purnell, and Mark Thompson. Enhancing video- based learning using knowledge tracing: Personaliz- ing students’ learning experience with ORBITS. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023

work page 2023

[54] [54]

Tidy- bot: personalized robot assistance with large lan- guage models.Auton

Jimmy Wu, Rika Antonova, Adam Kan, Marion Lep- ert, Andy Zeng, Shuran Song, Jeannette Bohg, Szy- mon Rusinkiewicz, and Thomas Funkhouser. Tidy- bot: personalized robot assistance with large lan- guage models.Auton. Robots, 2023

work page 2023

[55] [55]

Meta-learning with memory-augmented neural networks

AdamSantoro, SergeyBartunov, MatthewBotvinick, DaanWierstra, andTimothyLillicrap. Meta-learning with memory-augmented neural networks. InPro- ceedings of the 33rd International Conference on In- ternational Conference on Machine Learning - Volume 48, ICML’16, 2016

work page 2016

[56] [56]

Pro- totypical networks for few-shot learning

Jake Snell, Kevin Swersky, and Richard Zemel. Pro- totypical networks for few-shot learning. InProceed- ings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017

work page 2017

[57] [57]

A simple neural attentive meta- learner

Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta- learner. InInternational Conference on Learning Rep- resentations, 2018

work page 2018

[58] [58]

Meta-learning via language model in-context tuning

Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

work page 2022

[59] [59]

MetaICL: Learning to learn in con- text

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi. MetaICL: Learning to learn in con- text. InProceedings of the 2022 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, 2022

work page 2022

[60] [60]

Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016

YanDuan, JohnSchulman, XiChen, PeterL.Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016

work page 2016

[61] [61]

Learning to reinforcement learn, 2017

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn, 2017

work page 2017

[62] [62]

Personalizing dialogue agents via meta-learning

Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-learning. InProceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[63] [63]

Jalali, Varun Bharill, Yunbo Ouyang, Aastha Nigam, Divya Venugopalan, Aman Gupta, Fedor Borisyuk, Sathiya Keerthi, and AjithMuralidharan

Ruofan Wang, Prakruthi Prabhakar, Gaurav Srivas- tava, Tianqi Wang, Zeinab S. Jalali, Varun Bharill, Yunbo Ouyang, Aastha Nigam, Divya Venugopalan, Aman Gupta, Fedor Borisyuk, Sathiya Keerthi, and AjithMuralidharan. Limaml: Personalizationofdeep recommender models via meta learning. InProceed- ings of the 30th ACM SIGKDD Conference on Knowl- edge Discover...

work page 2024

[64] [64]

A meta-learning perspective on cold-start recommen- dations for items

Manasi Vartak, Arvind Thiagarajan, Conrado Mi- randa, Jeshua Bratman, and Hugo Larochelle. A meta-learning perspective on cold-start recommen- dations for items. InProceedings of the 31st Interna- tional Conference on Neural Information Processing Systems, NIPS’17, 2017

work page 2017

[65] [65]

M2eu: Meta learn- ing for cold-start recommendation via enhancing 16 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment user preference estimation

Zhenchao Wu and Xiao Zhou. M2eu: Meta learn- ing for cold-start recommendation via enhancing 16 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment user preference estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, 2023

work page 2023

[66] [66]

Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation

Minchang Kim, Yongjin Yang, Jung Hyun Ryu, and Taesup Kim. Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation. In Proceedingsofthe32ndACMInternationalConference on Information and Knowledge Management, CIKM ’23, 2023

work page 2023

[67] [67]

Melu: Meta-learned user preference estimator for cold-start recommendation

Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 2019

work page 2019

[68] [68]

Personal- LLM: Tailoring LLMs to individual preferences

Thomas P Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong. Personal- LLM: Tailoring LLMs to individual preferences. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[69] [69]

Meta-learning for cold-start personalization in prompt-tuned llms, 2025

Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, and Yinuo Yang. Meta-learning for cold-start personalization in prompt-tuned llms, 2025

work page 2025

[70] [70]

FSPO: Few-shot prefer- enceoptimizationofsyntheticpreferencedataelicits LLM personalization to real users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. FSPO: Few-shot prefer- enceoptimizationofsyntheticpreferencedataelicits LLM personalization to real users. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025

work page 2025

[71] [71]

Meta-learning with task-adaptive loss function for few-shot learn- ing

SungyongBaik,JanghoonChoi,HeewonKim,Dohee Cho, Jaesik Min, and Kyoung Mu Lee. Meta-learning with task-adaptive loss function for few-shot learn- ing. InProceedings of the IEEE/CVF international conference on computer vision, 2021

work page 2021

[72] [72]

Meta learning via learned loss

Sarah Bechtle, Artem Molchanov, Yevgen Chebo- tar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. InInternational Conference on Pattern Recognition, ICPR, Italy, January 10-15, 2021, 2021

work page 2021

[73] [73]

Rank anal- ysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank anal- ysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952

[74] [74]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Michael Bean, Katerina Margatina, Rafael Mosquera, Juan Manuel Ciro, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participa- tory, representative and individualised human feed- back reveals about the subjective and multicultural alignmen...

work page 2024

[75] [75]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InProceedings of the 34th International Conference on Neural Informa- tion Processing Systems, NIPS ’20, 2020

work page 2020

[76] [76]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The llama 3 herd of models, 2024

work page 2024

[77] [77]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2015. 17

work page 2015