One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment
Pith reviewed 2026-05-16 10:55 UTC · model grok-4.3
The pith
Meta Reward Modeling adapts LLM reward functions to new users with just a few feedback examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting personalized reward modeling as a meta-learning problem, MRM learns an initialization for the weights of base reward functions so that a small amount of new user feedback suffices to produce an effective individualized reward model. The method employs a MAML-style outer loop to optimize this initialization across many users and introduces the Robust Personalization Objective to emphasize hard-to-model users during meta-training, yielding consistent gains over non-meta baselines on personalized preference datasets.
What carries the argument
Meta Reward Modeling (MRM): a MAML-style meta-optimization that learns the initial weights of a linear combination of base reward functions so few-shot adaptation to an unseen user becomes reliable.
If this is right
- Personalized reward models can be produced with far fewer user-specific labels than current per-user fitting requires.
- Adaptation performance becomes more stable across users whose preferences deviate from the majority.
- The same meta-initialization supports rapid switching between different base reward architectures without retraining from scratch.
- Overall training cost for maintaining a fleet of personalized models decreases because most computation occurs once in the meta phase.
Where Pith is reading between the lines
- The same meta-initialization trick could be applied to other alignment modules such as safety classifiers or response-style adapters.
- Production systems could maintain a single meta-model and spin up per-user versions on demand with minimal storage overhead.
- Future work might combine MRM with online user modeling so that the base functions themselves evolve as new preference patterns emerge across the population.
Load-bearing premise
User preferences can be expressed well enough as a weighted sum of a fixed collection of base reward functions whose starting weights can be meta-learned to work for future unseen users.
What would settle it
A controlled test in which, for held-out users, a reward model trained from the meta-learned initialization with k feedback examples performs no better than an identical model trained from random weights with the same k examples.
read the original abstract
Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines. We release code at https://github.com/ModalityDance/MRM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Meta Reward Modeling (MRM) to address data scarcity and adaptation challenges in personalized LLM alignment. It models each user's reward function as a linear combination of base reward functions and uses a MAML-style outer loop to meta-optimize the weight initializations for fast few-shot adaptation to unseen users. A Robust Personalization Objective (RPO) is introduced to emphasize hard-to-learn users during meta-training. Experiments on personalized preference datasets are reported to show that MRM improves few-shot personalization performance and user robustness while outperforming baselines.
Significance. If the empirical claims hold after detailed verification, the work could meaningfully advance scalable personalized alignment by shifting focus from per-user fitting to learning the adaptation process itself. The meta-learning framing and code release support reproducibility and potential follow-up. However, significance hinges on whether the linear-span assumption for user preferences is sufficiently expressive in practice and whether reported gains survive stronger controls for base-function construction and statistical robustness.
major comments (3)
- [Method] Method section (MRM formulation): the central modeling choice r_u = sum w_i * r_base_i assumes user preferences lie (approximately) in the linear span of the chosen bases. No analysis, construction details for the bases, or empirical test is provided showing that this span is rich enough to cover non-linear or feature-interacting preferences; this assumption is load-bearing for the few-shot adaptation claim.
- [Experiments] Experiments section: the abstract states consistent outperformance and improved robustness, yet no details appear on base-function construction, exact metrics (e.g., reward accuracy vs. win rate), number of adaptation shots, variance across runs, or statistical significance tests. Without these, it is impossible to judge whether gains are robust or attributable to MRM rather than hyper-parameter choices.
- [Method] RPO description: the Robust Personalization Objective is presented as improving robustness to hard users, but no ablation isolating RPO from the base MAML procedure is reported. This leaves unclear whether the robustness gains are due to RPO or simply the meta-learning framework.
minor comments (2)
- [Abstract] Abstract: the number of base reward functions and the concrete datasets used should be stated explicitly to give readers an immediate sense of experimental scope.
- [Method] Notation: ensure the distinction between meta-training and meta-testing users is introduced with consistent symbols before the first equations.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, provide missing details, and strengthen the empirical support.
read point-by-point responses
-
Referee: [Method] Method section (MRM formulation): the central modeling choice r_u = sum w_i * r_base_i assumes user preferences lie (approximately) in the linear span of the chosen bases. No analysis, construction details for the bases, or empirical test is provided showing that this span is rich enough to cover non-linear or feature-interacting preferences; this assumption is load-bearing for the few-shot adaptation claim.
Authors: We appreciate the referee highlighting the importance of the linear-span assumption. The formulation uses a linear combination to enable tractable meta-optimization of the weight initializations via MAML. In the manuscript the base functions are constructed as a diverse collection of reward heads trained on clustered subsets of public preference data to capture varied preference dimensions. To directly address the concern we will add a dedicated subsection in Methods that (i) details the base-construction procedure, (ii) provides a brief theoretical motivation for why the linear span can approximate a wide class of preference functions when the bases are sufficiently diverse, and (iii) reports an auxiliary experiment on synthetic non-linear preference data demonstrating that the span covers the majority of test cases with low reconstruction error. These additions will make the modeling choice and its practical scope explicit. revision: yes
-
Referee: [Experiments] Experiments section: the abstract states consistent outperformance and improved robustness, yet no details appear on base-function construction, exact metrics (e.g., reward accuracy vs. win rate), number of adaptation shots, variance across runs, or statistical significance tests. Without these, it is impossible to judge whether gains are robust or attributable to MRM rather than hyper-parameter choices.
Authors: We agree that the current Experiments section lacks sufficient implementation and statistical detail. We will expand it to explicitly describe base-function construction (the same procedure referenced above), the primary evaluation metric (win rate on held-out user preferences, with reward-model accuracy reported as a secondary metric), the adaptation-shot regimes (1-shot, 5-shot, and 10-shot), performance variance (standard deviation across five independent random seeds), and statistical significance (paired t-tests against each baseline with p-values). These revisions will allow readers to verify that the reported improvements are attributable to MRM rather than hyper-parameter tuning. revision: yes
-
Referee: [Method] RPO description: the Robust Personalization Objective is presented as improving robustness to hard users, but no ablation isolating RPO from the base MAML procedure is reported. This leaves unclear whether the robustness gains are due to RPO or simply the meta-learning framework.
Authors: We acknowledge that an ablation isolating RPO is necessary. We will add a new ablation subsection in Experiments that compares the full MRM objective against a standard MAML baseline without the robust weighting term. The comparison will report both average and worst-case personalization performance across users, thereby quantifying the incremental benefit of RPO for robustness. revision: yes
Circularity Check
No significant circularity: meta-learning procedure is independent of fitted inputs
full rationale
The derivation presents MRM as a distinct MAML-style outer-loop optimization over initializations of weights in a linear combination of base reward functions. This is a new training procedure for fast adaptation rather than a re-expression or renaming of any fitted parameter or input data. No self-citations are invoked as load-bearing uniqueness theorems, no predictions reduce by construction to the meta-training set, and the central claims rest on empirical outperformance on held-out users. The formulation is self-contained against external benchmarks and does not collapse to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- meta-learning hyperparameters
axioms (1)
- domain assumption User preferences can be represented as weighted combinations of shared base reward functions
Reference graph
Works this paper leans on
-
[1]
Aligning large language models with human: A survey, 2023
Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey, 2023
work page 2023
-
[2]
Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human feedback.Transactions on Machine Learning Research, 2025. ISSN 2835-8856
work page 2025
-
[3]
Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. A survey on personalized Alignment— The missing piece for large language models in real- world applications. InFindings of the Association for Computational Linguistics: ACL 2025, 2025
work page 2025
-
[4]
Posi- tion: a roadmap to pluralistic alignment
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christo- pher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. Posi- tion: a roadmap to pluralistic alignment. InProceed- ings of the 41st International Conference on Machine Learning, ICML’24, 2024. 13 One Adapts to Any: Meta...
work page 2024
-
[5]
Personalization of large language models: A survey
ZhehaoZhang,RyanA.Rossi,BranislavKveton,Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernon- court, Joe Barrow, Tong Yu, Sungchul Kim, et al. Personalization of large language models: A survey. Transactions on Machine Learning Research, 2025. Survey Certification
work page 2025
-
[6]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences. InPro- ceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, 2017
work page 2017
-
[7]
Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback, 2022
work page 2022
-
[8]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Traininglanguagemodelstofollowinstructionswith human feedback.Advances in neural information processing systems, 2022
work page 2022
-
[9]
Towards harmless multimodal assistants with blind preference optimization, 2025
Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, and Liqiang Nie. Towards harmless multimodal assistants with blind preference optimization, 2025
work page 2025
-
[10]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023
work page 2023
-
[11]
Personalizing reinforcement learning from human feedback with variational preference learning
Sriyash Poddar, Yanming Wan, Hamish Ivison, Ab- hishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[12]
PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment
Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak. PAL: Sample-efficient per- sonalized reward modeling for pluralistic alignment. InThe Thirteenth International Conference on Learn- ing Representations, 2025
work page 2025
-
[13]
Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang
Michael J. Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, and Diyi Yang. Syn- thesizeMe! inducing persona-guided prompts for personalized reward models in LLMs. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025
work page 2025
-
[14]
Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy, 2025
work page 2025
-
[15]
Lore: Personalizing llms via low-rank reward modeling, 2025
Avinandan Bose, Zhihan Xiong, Yuejie Chi, Si- mon Shaolei Du, Lin Xiao, and Maryam Fazel. Lore: Personalizing llms via low-rank reward modeling, 2025
work page 2025
-
[16]
Guided profile generation improves personalization with large language models
Jiarui Zhang. Guided profile generation improves personalization with large language models. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, 2024
work page 2024
-
[17]
Group preference optimization: Few-shot alignment of large language models
Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[18]
Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. InThirty- seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[19]
Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023
Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023
work page 2023
-
[20]
Language model personalization via reward factorization
Idan Shenfeld, Felix Faltings, Pulkit Agrawal, and Aldo Pacchiano. Language model personalization via reward factorization. InSecond Conference on Language Modeling, 2025
work page 2025
-
[21]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th Inter- national Conference on Machine Learning - Volume 70, ICML’17, 2017
work page 2017
-
[22]
A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025
Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, HuaLu, YichengChen, YangZhang, WeiZhou, Jinjie Gu, and Lei Zou. A comprehensive survey of reward models: Taxonomy, applications, challenges, and future, 2025
work page 2025
-
[23]
Internlm2 techni- cal report, 2024
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, et al. Internlm2 techni- cal report, 2024
work page 2024
-
[24]
Advancing LLM reasoning generalists withpreferencetrees
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Boji Shan, Zeyuan Liu, Jia Deng, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing LLM reasoning generalists withpreferencetrees. InTheThirteenthInternational Conference on Learning Representations, 2025
work page 2025
-
[25]
Skywork-reward: Bag of tricks for reward modeling in llms, 2024
Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, 14 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms, 2024
work page 2024
-
[26]
Interpretable preferences via multi-objective reward modeling and mixture-of- experts
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of- experts. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, 2024
work page 2024
-
[27]
Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, JosephE.Gonzalez,andIonStoica. Judgingllm-as-a- judge with mt-bench and chatbot arena. InProceed- ings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, 2023
work page 2023
-
[28]
Generative judge for evaluating alignment
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, hai zhao, and Pengfei Liu. Generative judge for evaluating alignment. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[29]
Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024
Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. Compassjudger-1: All-in-one judge model helps model evaluation and evolution, 2024
work page 2024
-
[30]
Learn- ing LLM-as-a-judge for preference alignment
Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun LIU. Learn- ing LLM-as-a-judge for preference alignment. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[31]
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khal- man, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feed- back, 2023
work page 2023
-
[32]
From $r$ to $q^*$: Your language model is secretlyaq-function
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From $r$ to $q^*$: Your language model is secretlyaq-function. InFirstConferenceonLanguage Modeling, 2024
work page 2024
-
[33]
A general theo- retical paradigm to understand learning from hu- man preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theo- retical paradigm to understand learning from hu- man preferences. InProceedings of The 27th In- ternational Conference on Artificial Intelligence and Statistics, 2024
work page 2024
-
[34]
Model alignment as prospecttheoreticoptimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospecttheoreticoptimization. InProceedingsofthe 41st International Conference on Machine Learning, ICML’24, 2024
work page 2024
-
[35]
Starling-7b: Improving helpfulness and harmlessness with RLAIF
Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao. Starling-7b: Improving helpfulness and harmlessness with RLAIF. InFirst Conference on Language Modeling, 2024
work page 2024
-
[36]
Regularizing hidden states en- ables learning generalizable reward model for LLMs
Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states en- ables learning generalizable reward model for LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[37]
Making language models better reasoners with step-aware verifier
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InProceedings of the 61st Annual Meeting of theAssociationforComputationalLinguistics(Volume 1: Long Papers), 2023
work page 2023
-
[38]
Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins
Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Yamamoto Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solvingmathwordproblemswithprocess-basedand outcome-based feedback, 2023
work page 2023
-
[39]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Har- rison Edwards, Bowen Baker, Teddy Lee, Jan Leike, JohnSchulman,IlyaSutskever,andKarlCobbe. Let’s verify step by step. InThe Twelfth International Con- ference on Learning Representations, 2024
work page 2024
-
[40]
Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by- step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
work page 2024
-
[41]
ZihanWang,YunxuanLi,YuexinWu,LiangchenLuo, Le Hou, Hongkun Yu, and Jingbo Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, 2024
work page 2024
-
[42]
Two tales of persona in LLMs: A survey of role-playing and personalization
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun- Nung Chen. Two tales of persona in LLMs: A survey of role-playing and personalization. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024
work page 2024
-
[43]
Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea
Charles Welch, Chenxi Gu, Jonathan K. Kummerfeld, Veronica Perez-Rosas, and Rada Mihalcea. Lever- aging similar users for personalized language mod- eling with limited data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
work page 2022
-
[44]
Membership inference attacks against machine learning models
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE symposium on security and privacy (SP), 2017
work page 2017
-
[45]
Evaluating approaches to personalizing language models
Milton King and Paul Cook. Evaluating approaches to personalizing language models. InProceedings of the Twelfth Language Resources and Evaluation Conference, 2020. 15 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment
work page 2020
-
[46]
Nikita Soni, Matthew Matero, Niranjan Balasubra- manian, and H. Andrew Schwartz. Human language modeling. InFindings of the Association for Compu- tational Linguistics: ACL 2022, 2022
work page 2022
-
[47]
Fatemehsadat Mireshghallah, Vaishnavi Shrivastava, Milad Shokouhi, Taylor Berg-Kirkpatrick, Robert Sim, and Dimitrios Dimitriadis. UserIdentifier: Im- plicit user representations for simple and effective personalized sentiment analysis. InProceedings of the 2022 Conference of the North American Chapter of theAssociationforComputationalLinguistics: Human L...
work page 2022
-
[48]
Large language models empowered personalized web agents
Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin Zhu, Xiaoyu Shen, Wenjie Li, and Tat-Seng Chua. Large language models empowered personalized web agents. InProceedings of the ACM on Web Conference 2025, WWW ’25, 2025
work page 2025
-
[49]
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personaliz- ing dialogue agents: I have a dog, do you have pets too? InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018
work page 2018
-
[50]
Training millions of personalized dialogue agents
Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. Training millions of personalized dialogue agents. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[51]
Less is more: Learning to refine dialogue history for personalized dialogue generation
Hanxun Zhong, Zhicheng Dou, Yutao Zhu, Hongjin Qian, and Ji-Rong Wen. Less is more: Learning to refine dialogue history for personalized dialogue generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, 2022
work page 2022
-
[52]
Health-llm: Personalized retrieval- augmented disease prediction system, 2025
Qinkai Yu, Mingyu Jin, Dong Shu, Chong Zhang, Lizhou Fan, Wenyue Hua, Suiyuan Zhu, Yanda Meng, Zhenting Wang, Mengnan Du, and Yongfeng Zhang. Health-llm: Personalized retrieval- augmented disease prediction system, 2025
work page 2025
-
[53]
Shady Shehata, David Santandreu Calonge, Philip Purnell, and Mark Thompson. Enhancing video- based learning using knowledge tracing: Personaliz- ing students’ learning experience with ORBITS. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023
work page 2023
-
[54]
Tidy- bot: personalized robot assistance with large lan- guage models.Auton
Jimmy Wu, Rika Antonova, Adam Kan, Marion Lep- ert, Andy Zeng, Shuran Song, Jeannette Bohg, Szy- mon Rusinkiewicz, and Thomas Funkhouser. Tidy- bot: personalized robot assistance with large lan- guage models.Auton. Robots, 2023
work page 2023
-
[55]
Meta-learning with memory-augmented neural networks
AdamSantoro, SergeyBartunov, MatthewBotvinick, DaanWierstra, andTimothyLillicrap. Meta-learning with memory-augmented neural networks. InPro- ceedings of the 33rd International Conference on In- ternational Conference on Machine Learning - Volume 48, ICML’16, 2016
work page 2016
-
[56]
Pro- totypical networks for few-shot learning
Jake Snell, Kevin Swersky, and Richard Zemel. Pro- totypical networks for few-shot learning. InProceed- ings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 2017
work page 2017
-
[57]
A simple neural attentive meta- learner
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta- learner. InInternational Conference on Learning Rep- resentations, 2018
work page 2018
-
[58]
Meta-learning via language model in-context tuning
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
work page 2022
-
[59]
MetaICL: Learning to learn in con- text
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi. MetaICL: Learning to learn in con- text. InProceedings of the 2022 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, 2022
work page 2022
-
[60]
Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016
YanDuan, JohnSchulman, XiChen, PeterL.Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast rein- forcement learning via slow reinforcement learning, 2016
work page 2016
-
[61]
Learning to reinforcement learn, 2017
Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn, 2017
work page 2017
-
[62]
Personalizing dialogue agents via meta-learning
Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-learning. InProceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[63]
Ruofan Wang, Prakruthi Prabhakar, Gaurav Srivas- tava, Tianqi Wang, Zeinab S. Jalali, Varun Bharill, Yunbo Ouyang, Aastha Nigam, Divya Venugopalan, Aman Gupta, Fedor Borisyuk, Sathiya Keerthi, and AjithMuralidharan. Limaml: Personalizationofdeep recommender models via meta learning. InProceed- ings of the 30th ACM SIGKDD Conference on Knowl- edge Discover...
work page 2024
-
[64]
A meta-learning perspective on cold-start recommen- dations for items
Manasi Vartak, Arvind Thiagarajan, Conrado Mi- randa, Jeshua Bratman, and Hugo Larochelle. A meta-learning perspective on cold-start recommen- dations for items. InProceedings of the 31st Interna- tional Conference on Neural Information Processing Systems, NIPS’17, 2017
work page 2017
-
[65]
Zhenchao Wu and Xiao Zhou. M2eu: Meta learn- ing for cold-start recommendation via enhancing 16 One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment user preference estimation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, 2023
work page 2023
-
[66]
Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation
Minchang Kim, Yongjin Yang, Jung Hyun Ryu, and Taesup Kim. Meta-learning with adaptive weighted loss for imbalanced cold-start recommendation. In Proceedingsofthe32ndACMInternationalConference on Information and Knowledge Management, CIKM ’23, 2023
work page 2023
-
[67]
Melu: Meta-learned user preference estimator for cold-start recommendation
Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. Melu: Meta-learned user preference estimator for cold-start recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 2019
work page 2019
-
[68]
Personal- LLM: Tailoring LLMs to individual preferences
Thomas P Zollo, Andrew Wei Tung Siah, Naimeng Ye, Ang Li, and Hongseok Namkoong. Personal- LLM: Tailoring LLMs to individual preferences. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[69]
Meta-learning for cold-start personalization in prompt-tuned llms, 2025
Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, and Yinuo Yang. Meta-learning for cold-start personalization in prompt-tuned llms, 2025
work page 2025
-
[70]
Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. FSPO: Few-shot prefer- enceoptimizationofsyntheticpreferencedataelicits LLM personalization to real users. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025
work page 2025
-
[71]
Meta-learning with task-adaptive loss function for few-shot learn- ing
SungyongBaik,JanghoonChoi,HeewonKim,Dohee Cho, Jaesik Min, and Kyoung Mu Lee. Meta-learning with task-adaptive loss function for few-shot learn- ing. InProceedings of the IEEE/CVF international conference on computer vision, 2021
work page 2021
-
[72]
Meta learning via learned loss
Sarah Bechtle, Artem Molchanov, Yevgen Chebo- tar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. InInternational Conference on Pattern Recognition, ICPR, Italy, January 10-15, 2021, 2021
work page 2021
-
[73]
Rank anal- ysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank anal- ysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[74]
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Michael Bean, Katerina Margatina, Rafael Mosquera, Juan Manuel Ciro, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participa- tory, representative and individualised human feed- back reveals about the subjective and multicultural alignmen...
work page 2024
-
[75]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InProceedings of the 34th International Conference on Neural Informa- tion Processing Systems, NIPS ’20, 2020
work page 2020
-
[76]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The llama 3 herd of models, 2024
work page 2024
-
[77]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2015. 17
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.