arxiv: 2510.17881 · v3 · submitted 2025-10-17 · 💻 cs.CL · cs.AI

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

Yizhuo Chen , Xin Liu , Ruijie Wang , Zheng Li , Pei Chen , Changlong Yu , Qingyu Yin , Priyanka Nigam

show 2 more authors

Meng Jiang Bing Yin

This is my paper

Pith reviewed 2026-05-18 05:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM personalizationpreference inferencenatural language summariespreference optimizationreinforcement learningcontext reductionuser modeling

0 comments

The pith

POPI learns to infer short natural-language summaries of user preferences that improve personalization across different LLMs while cutting context length by up to ten times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces POPI to address how large language models are usually tuned to average user preferences rather than individual ones. It splits the task into an inference step that turns varied user data into a compact natural-language preference summary and a generation step that produces responses based on that summary. Both steps train together under one preference-optimization objective, with reinforcement learning managing the non-differentiable inference part. The natural-language format lets a summary be created once per user and then applied to many different models, including fixed commercial systems. Results on four benchmarks show higher personalization quality alongside much lower context requirements.

Core claim

POPI demonstrates that a single preference-optimization objective can jointly train an inference model to distill user signals into concise natural-language summaries and a generator to condition on those summaries, with the objective splitting into generator approximation error and summary informativeness. This decomposition supports accurate personalized outputs while ensuring the summaries carry useful information, and the language interface makes the summaries portable across generators including black-box APIs.

What carries the argument

The natural-language preference summary that serves as the reusable interface between a shared inference model and a shared generator under one unified optimization objective.

If this is right

Summaries inferred from user signals can be generated once and then reused across multiple different generators.
Personalization quality rises on four benchmarks while context overhead falls by up to an order of magnitude.
The method applies directly to black-box commercial LLMs without any fine-tuning of those models.
The loss decomposition simultaneously improves both summary informativeness and generation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single portable preference summary could serve a user across many separate AI services without repeated full-context processing.
Inference-time efficiency could increase because only the short summary needs to be included rather than raw user history.
The separation of inference and generation might be tested on whether summaries stay effective when moved to generators with very different architectures.

Load-bearing premise

Concise natural-language preference summaries inferred once per user can be reused across different generators, including frozen commercial APIs, without substantial loss of personalization effectiveness.

What would settle it

A test on a new generator where feeding the inferred summaries produces no measurable gain in personalization quality over using only population-level defaults or the full original context.

Figures

Figures reproduced from arXiv: 2510.17881 by Bing Yin, Changlong Yu, Meng Jiang, Pei Chen, Priyanka Nigam, Qingyu Yin, Ruijie Wang, Xin Liu, Yizhuo Chen, Zheng Li.

**Figure 1.** Figure 1: Overview of POPI usage. Heterogeneous user signals (such as textual personas, few-shot preference pairs, and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Training framework of POPI. Given raw user signals, the preference inference LLM generates a natural language [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The template used when querying GPT-4o to serve [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Left: templates for the preference inference LLM, which transforms raw user signals (or user signals plus prompt) [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative case study on the ELIX dataset. Each row corresponds to one of the five ground-truth personas in ELIX, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are typically aligned with population-level preferences, despite substantial variation across individual users. We introduce POPI, a user-level personalization framework that separates the problem into two components connected by a natural-language interface: a shared inference model that distills heterogeneous user signals into a concise preference summary, and a shared generator that conditions on this summary to produce personalized responses. Both components are trained under a unified preference-optimization objective, with reinforcement learning handling the non-differentiable inference step. This objective decomposes into generator approximation error and summary informativeness, revealing how a single loss simultaneously drives accurate generation and informative summarization. Because the interface is natural language, learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs. Across four personalization benchmarks, POPI generally improves personalization quality while reducing context overhead by up to an order of magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POPI frames personalization as learning reusable NL preference summaries via a joint RL preference objective, but the cross-generator transfer claim rests on an untested assumption.

read the letter

The paper's central move is to split personalization into an inference step that turns user signals into a short natural-language preference summary and a generation step that conditions on that summary. Both pieces are trained together under one preference-optimization loss, with RL handling the non-differentiable inference. The natural-language interface is the selling point: once you have the summary for a user, you can supposedly feed it to any generator, including frozen commercial APIs, without retraining or long context histories. That framing is cleaner than most prior personalization work that either fine-tunes the whole model or stuffs everything into the prompt each time. If the four benchmarks actually show quality gains plus an order-of-magnitude drop in context tokens, the engineering payoff is real and worth noting. The decomposition of the objective into generator approximation error plus summary informativeness also gives a useful way to diagnose where the gains are coming from. Credit for shipping a concrete method that tries to make the summaries portable rather than model-specific. The main soft spot is exactly the one the stress-test note flags. Training is joint, so the summaries are optimized for the particular generator they see during RL. Nothing in the abstract or the described setup shows a direct test of feeding those same summaries into an unrelated black-box model and measuring whether personalization quality holds up. Without that experiment, the reuse claim stays more aspirational than demonstrated. The soundness numbers in the reader's report also line up with what is visible: gains are asserted but the quantitative tables, baselines, and variance numbers are not in the abstract, so it is hard to judge effect sizes or stability of the RL step. This is the kind of paper that belongs in a reading group for people working on practical LLM alignment and efficiency. It is not proposing a new theoretical paradigm, but it packages existing preference techniques into a reusable interface that could matter for deployment. I would send it to peer review. The core idea is coherent, the problem is relevant, and the missing transfer test is fixable with additional experiments rather than a fatal flaw in the framing.

Referee Report

2 major / 2 minor

Summary. The paper introduces POPI, a user-level personalization framework that separates the problem into a shared inference model distilling heterogeneous user signals into concise natural-language preference summaries and a shared generator that conditions on these summaries. Both are trained under a unified preference-optimization objective that decomposes into generator approximation error and summary informativeness, with reinforcement learning handling the non-differentiable inference step. The natural-language interface is claimed to allow summaries inferred once per user to be reused across different generators, including frozen black-box commercial APIs. Experiments across four personalization benchmarks report general improvements in personalization quality alongside context overhead reductions of up to an order of magnitude.

Significance. If the transferability of the inferred summaries to unseen generators holds and the unified objective is shown to be non-circular, the framework could enable practical, generator-agnostic personalization that substantially reduces context length requirements for individual users across both open and proprietary LLMs.

major comments (2)

[Abstract] Abstract: The central claim that 'learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs' lacks direct empirical support. The reported benchmarks evaluate jointly trained inference-generator pairs; no transfer experiments to unseen or black-box generators are described, leaving the generator-agnostic reuse assumption untested and load-bearing for the broader applicability asserted.
[Abstract] The decomposition of the unified objective into generator approximation error and summary informativeness is asserted without the full derivation or equations. It remains unclear whether the informativeness measure is independently grounded or reduces to a quantity fitted from the same preference data used to train the generator, raising a potential circularity risk for the claimed separation of concerns.

minor comments (2)

The abstract states empirical gains and context reductions but provides no quantitative results, specific baselines, or error analysis; including key numbers and a brief comparison table in the abstract or introduction would improve accessibility.
Details on the stability and variance of the RL step for the non-differentiable inference are asserted but not elaborated; a short discussion or ablation on training dynamics would strengthen the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs' lacks direct empirical support. The reported benchmarks evaluate jointly trained inference-generator pairs; no transfer experiments to unseen or black-box generators are described, leaving the generator-agnostic reuse assumption untested and load-bearing for the broader applicability asserted.

Authors: We agree that the current experiments evaluate the jointly trained inference-generator system and do not include explicit transfer tests to unseen or black-box generators. The natural-language interface is explicitly designed to support generator-agnostic reuse, as summaries are produced independently of any generator parameters. To provide direct empirical support for the claim, we will add transfer experiments applying the inferred summaries to a frozen commercial API in the revised manuscript. revision: yes
Referee: [Abstract] The decomposition of the unified objective into generator approximation error and summary informativeness is asserted without the full derivation or equations. It remains unclear whether the informativeness measure is independently grounded or reduces to a quantity fitted from the same preference data used to train the generator, raising a potential circularity risk for the claimed separation of concerns.

Authors: We acknowledge that the abstract asserts the decomposition without presenting the equations. The motivation for the decomposition is discussed in the methods, but we agree that a full derivation would clarify the separation. In the revision we will add the explicit equations showing the unified objective as the sum of generator approximation error (under the true user preference) and an informativeness term defined via mutual information between the summary and user signals. This term is computed from the preference data independently of generator parameters, avoiding circularity; we will expand this explanation in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a unified preference-optimization objective that decomposes into generator approximation error and summary informativeness, with the natural-language interface enabling reuse across generators. No equations or definitions are shown that make the informativeness measure equivalent to a fitted parameter or input data by construction, nor does any load-bearing step reduce to self-citation or renaming. The joint training and benchmark evaluations provide independent empirical content for the claims of quality improvement and context reduction, rendering the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that natural language is a lossless enough medium for preference summaries and that the RL-augmented objective simultaneously optimizes both summary quality and generation accuracy; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Natural language summaries can capture heterogeneous user signals sufficiently for reuse across generators.
Invoked when claiming summaries can be inferred once and applied to black-box APIs.

pith-pipeline@v0.9.0 · 5710 in / 1219 out tokens · 26988 ms · 2026-05-18T05:37:02.436175+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified preference-optimization objective... decomposes into generator approximation error and summary informativeness
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

natural language interface... reused across different generators including frozen black-box commercial APIs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

[1]

Mistral models overview

Mistral AI. Mistral models overview. https://docs.mistral.ai/getting-started/ models/models_overview/, 2024

work page 2024
[2]

Introducing claude 4 — claude sonnet 4 and claude opus 4

Anthropic. Introducing claude 4 — claude sonnet 4 and claude opus 4. https: //www.anthropic.com/news/claude-4, 2025

work page 2025
[3]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pp. 4447–4455. PMLR, 2024

work page 2024
[4]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKin- non, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Lore: Personalizing llms via low-rank reward modeling.arXiv preprint arXiv:2504.14439, 2025

Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, and Maryam Fazel. Lore: Personalizing llms via low-rank reward modeling.arXiv preprint arXiv:2504.14439, 2025

work page arXiv 2025
[6]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

work page 1952
[7]

Pal: Pluralistic alignment framework for learning from heterogeneous preferences.arXiv preprint arXiv:2406.08469, 2024

Daiwei Chen, Yi Chen, Aniket Rege, and Ramya Korlakai Vinayak. Pal: Pluralistic alignment framework for learning from heterogeneous preferences.arXiv preprint arXiv:2406.08469, 2024

work page arXiv 2024
[8]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[9]

Authorship attribution using author profiling classifiers.Natural Language Engineering, 29(1):110–137, 2023

Caio Deutsch and Ivandré Paraboni. Authorship attribution using author profiling classifiers.Natural Language Engineering, 29(1):110–137, 2023

work page 2023
[10]

The llama 3 herd of models.arXiv e-prints, pp

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pp. arXiv–2407, 2024

work page 2024
[11]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

On the creativity of large language models.AI & society, 40(5):3785–3795, 2025

Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & society, 40(5):3785–3795, 2025

work page 2025
[13]

End-to-end train- ing for recommendation with language-based user profiles.arXiv preprint arXiv:2410.18870, 2024

Zhaolin Gao, Joyce Zhou, Yijia Dai, and Thorsten Joachims. End-to-end train- ing for recommendation with language-based user profiles.arXiv preprint arXiv:2410.18870, 2024

work page arXiv 2024
[14]

Hyperalign: Interpretable personalized llm alignment via hypothesis generation.arXiv preprint arXiv:2505.00038, 2025

Cristina Garbacea and Chenhao Tan. Hyperalign: Interpretable personalized llm alignment via hypothesis generation.arXiv preprint arXiv:2505.00038, 2025

work page arXiv 2025
[15]

Sumrecom: A personal- ized summarization approach by learning from users’ feedback.arXiv preprint arXiv:2408.07294, 2024

Samira Ghodratnama and Mehrdad Zakershahrak. Sumrecom: A personal- ized summarization approach by learning from users’ feedback.arXiv preprint arXiv:2408.07294, 2024

work page arXiv 2024
[16]

A survey on personalized alignment–the missing piece for large language models in real-world applications.arXiv preprint arXiv:2503.17003, 2025

Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. A survey on personalized alignment–the missing piece for large language models in real-world applications.arXiv preprint arXiv:2503.17003, 2025

work page arXiv 2025
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Authorship attribution methods, challenges, and future research directions: A comprehensive survey.Information, 15(3):131, 2024

Xie He, Arash Habibi Lashkari, Nikhill Vombatkere, and Dilli Prasad Sharma. Authorship attribution methods, challenges, and future research directions: A comprehensive survey.Information, 15(3):131, 2024

work page 2024
[19]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference opti- mization without reference model.arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Context rot: How increasing input tokens impacts llm performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Technical report, Chroma, July 2025. URL https://research. trychroma. com . . . , 2025

work page 2025
[21]

Rali@ trec ikat 2024: Achieving personalization via retrieval fusion in conversational search.arXiv preprint arXiv:2412.07998, 2024

Yuchen Hui, Fengran Mo, Milan Mao, and Jian-Yun Nie. Rali@ trec ikat 2024: Achieving personalization via retrieval fusion in conversational search.arXiv preprint arXiv:2412.07998, 2024

work page arXiv 2024
[22]

Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

work page 2009
[23]

Aligning to thousands of preferences via system message generalization.Advances in Neural Information Processing Systems, 37:73783–73829, 2024

Seongyun Lee, Sue Hyun Park, Seungone Kim, and Minjoon Seo. Aligning to thousands of preferences via system message generalization.Advances in Neural Information Processing Systems, 37:73783–73829, 2024

work page 2024
[24]

Test-Time Alignment via Hypothesis Reweighting

Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, and Chelsea Finn. Test-time alignment via hypothesis reweighting. arXiv preprint arXiv:2412.08812, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

From 1,000,000 users to every user: Scaling up personalized preference for user-level alignment.arXiv preprint arXiv:2503.15463, 2025

Jia-Nan Li, Jian Guan, Songhao Wu, Wei Wu, and Rui Yan. From 1,000,000 users to every user: Scaling up personalized preference for user-level alignment.arXiv preprint arXiv:2503.15463, 2025

work page arXiv 2025
[26]

Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

work page arXiv 2024
[27]

Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

work page arXiv 2024
[28]

One size doesn’t fit all: A personalized conversational tutoring agent for mathematics instruction

Ben Liu, Jihai Zhang, Fangquan Lin, Xu Jia, and Min Peng. One size doesn’t fit all: A personalized conversational tutoring agent for mathematics instruction. In Companion Proceedings of the ACM on Web Conference 2025, pp. 2401–2410, 2025

work page 2025
[29]

Personality-aware student simulation for conversational intelligent tutoring systems.arXiv preprint arXiv:2404.06762, 2024

Zhengyuan Liu, Stella Xin Yin, Geyu Lin, and Nancy F Chen. Personality-aware student simulation for conversational intelligent tutoring systems.arXiv preprint arXiv:2404.06762, 2024

work page arXiv 2024
[30]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

work page 2024
[31]

Revisiting group relative policy optimization: Insights into on-policy and off-policy training

Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group relative policy optimization: Insights into on-policy and off-policy training. arXiv preprint arXiv:2505.22257, 2025

work page arXiv 2025
[32]

Hello gpt-4o

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/?utm_source= chatgpt.com, 2024

work page 2024
[33]

Gpt-4o mini: Advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ ?utm_source=chatgpt.com, 2024

work page 2024
[34]

Train- ing language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[35]

Personalizing reinforcement learning from human feedback with varia- tional preference learning.Advances in Neural Information Processing Systems, 37:52516–52544, 2024

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with varia- tional preference learning.Advances in Neural Information Processing Systems, 37:52516–52544, 2024

work page 2024
[36]

On natural language user profiles for transparent and scrutable recommendation

Filip Radlinski, Krisztian Balog, Fernando Diaz, Lucas Dixon, and Ben Wedin. On natural language user profiles for transparent and scrutable recommendation. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp. 2863–2874, 2022

work page 2022
[37]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

work page 2023
[38]

Transpar- ent and scrutable recommendations using natural language user profiles.arXiv preprint arXiv:2402.05810, 2024

Jerome Ramos, Hossen A Rahmani, Xi Wang, Xiao Fu, and Aldo Lipani. Transpar- ent and scrutable recommendations using natural language user profiles.arXiv preprint arXiv:2402.05810, 2024

work page arXiv 2024
[39]

Personabot: Bringing customer personas to life with llms and rag.arXiv preprint arXiv:2505.17156, 2025

Muhammed Rizwan, Lars Carlsson, and Mohammad Loni. Personabot: Bringing customer personas to life with llms and rag.arXiv preprint arXiv:2505.17156, 2025

work page arXiv 2025
[40]

Item-based collaborative filtering recommendation algorithms

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. InProceedings of the 10th international conference on World Wide Web, pp. 285–295, 2001

work page 2001
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. Fspo: Few-shot preference opti- mization of synthetic preference data in llms elicits effective personalization to real users.arXiv preprint arXiv:2502.19312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

A review of large language models in medical education, clinical decision support, and healthcare administration

Josip Vrdoljak, Zvonimir Boban, Marino Vilović, Marko Kumrić, and Joško Božić. A review of large language models in medical education, clinical decision support, and healthcare administration. InHealthcare, volume 13, pp. 603. MDPI, 2025

work page 2025
[44]

Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

work page arXiv 2023
[45]

Wikipedia contributors. Gpt-4o. https://en.wikipedia.org/wiki/GPT-4o, 2025

work page 2025
[46]

A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070, 2025

Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, et al. A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070, 2025

work page arXiv 2025
[47]

Co-persona: Leveraging llms and expert collaboration to understand user personas through social media data analysis.arXiv preprint arXiv:2506.18269, 2025

Min Yin, Haoyu Liu, Boyi Lian, and Chunlei Chai. Co-persona: Leveraging llms and expert collaboration to understand user personas through social media data analysis.arXiv preprint arXiv:2506.18269, 2025

work page arXiv 2025
[48]

Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

work page arXiv 2023
[49]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Personalsum: A user-subjective guided personalized sum- marization dataset for large language models.Advances in Neural Information Processing Systems, 37:99333–99346, 2024

Lemei Zhang, Peng Liu, Marcus Henriksboe, Even Lauvrak, Jon Atle Gulla, and Heri Ramampiaro. Personalsum: A user-subjective guided personalized sum- marization dataset for large language models.Advances in Neural Information Processing Systems, 37:99333–99346, 2024

work page 2024
[51]

Improving personalised query reformulation with embeddings

Xiaojuan Zhang. Improving personalised query reformulation with embeddings. Journal of Information Science, 48(4):503–523, 2022. Under review as a conference paper at WWW ’26, April 13–17, 2026, Dubai, UAE Chen et al

work page 2022
[52]

Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

work page arXiv 2023
[53]

Nextquill: Causal preference modeling for enhancing llm personalization.arXiv preprint arXiv:2506.02368, 2025

Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Nextquill: Causal preference modeling for enhancing llm personalization.arXiv preprint arXiv:2506.02368, 2025

work page arXiv 2025
[54]

arXiv preprint arXiv:2305.10425 , year=

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425, 2023

work page arXiv 2023
[55]

Language-based user profiles for recommendation.arXiv preprint arXiv:2402.15623, 2024

Joyce Zhou, Yijia Dai, and Thorsten Joachims. Language-based user profiles for recommendation.arXiv preprint arXiv:2402.15623, 2024

work page arXiv 2024
[56]

Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chen- hao Tan. Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

work page arXiv 2024
[57]

user_instruction

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey.arXiv preprint arXiv:2308.07107, 2023. Appendix A Derivation of Inequality 9 We derive the information-theoretic interpretation of the summary- augmented DPO object...

work page arXiv 2023