pith. machine review for the scientific record. sign in

arxiv: 2510.17881 · v3 · submitted 2025-10-17 · 💻 cs.CL · cs.AI

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

Pith reviewed 2026-05-18 05:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM personalizationpreference inferencenatural language summariespreference optimizationreinforcement learningcontext reductionuser modeling
0
0 comments X

The pith

POPI learns to infer short natural-language summaries of user preferences that improve personalization across different LLMs while cutting context length by up to ten times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces POPI to address how large language models are usually tuned to average user preferences rather than individual ones. It splits the task into an inference step that turns varied user data into a compact natural-language preference summary and a generation step that produces responses based on that summary. Both steps train together under one preference-optimization objective, with reinforcement learning managing the non-differentiable inference part. The natural-language format lets a summary be created once per user and then applied to many different models, including fixed commercial systems. Results on four benchmarks show higher personalization quality alongside much lower context requirements.

Core claim

POPI demonstrates that a single preference-optimization objective can jointly train an inference model to distill user signals into concise natural-language summaries and a generator to condition on those summaries, with the objective splitting into generator approximation error and summary informativeness. This decomposition supports accurate personalized outputs while ensuring the summaries carry useful information, and the language interface makes the summaries portable across generators including black-box APIs.

What carries the argument

The natural-language preference summary that serves as the reusable interface between a shared inference model and a shared generator under one unified optimization objective.

If this is right

  • Summaries inferred from user signals can be generated once and then reused across multiple different generators.
  • Personalization quality rises on four benchmarks while context overhead falls by up to an order of magnitude.
  • The method applies directly to black-box commercial LLMs without any fine-tuning of those models.
  • The loss decomposition simultaneously improves both summary informativeness and generation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single portable preference summary could serve a user across many separate AI services without repeated full-context processing.
  • Inference-time efficiency could increase because only the short summary needs to be included rather than raw user history.
  • The separation of inference and generation might be tested on whether summaries stay effective when moved to generators with very different architectures.

Load-bearing premise

Concise natural-language preference summaries inferred once per user can be reused across different generators, including frozen commercial APIs, without substantial loss of personalization effectiveness.

What would settle it

A test on a new generator where feeding the inferred summaries produces no measurable gain in personalization quality over using only population-level defaults or the full original context.

Figures

Figures reproduced from arXiv: 2510.17881 by Bing Yin, Changlong Yu, Meng Jiang, Pei Chen, Priyanka Nigam, Qingyu Yin, Ruijie Wang, Xin Liu, Yizhuo Chen, Zheng Li.

Figure 1
Figure 1. Figure 1: Overview of POPI usage. Heterogeneous user signals (such as textual personas, few-shot preference pairs, and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training framework of POPI. Given raw user signals, the preference inference LLM generates a natural language [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The template used when querying GPT-4o to serve [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: templates for the preference inference LLM, which transforms raw user signals (or user signals plus prompt) [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative case study on the ELIX dataset. Each row corresponds to one of the five ground-truth personas in ELIX, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are typically aligned with population-level preferences, despite substantial variation across individual users. We introduce POPI, a user-level personalization framework that separates the problem into two components connected by a natural-language interface: a shared inference model that distills heterogeneous user signals into a concise preference summary, and a shared generator that conditions on this summary to produce personalized responses. Both components are trained under a unified preference-optimization objective, with reinforcement learning handling the non-differentiable inference step. This objective decomposes into generator approximation error and summary informativeness, revealing how a single loss simultaneously drives accurate generation and informative summarization. Because the interface is natural language, learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs. Across four personalization benchmarks, POPI generally improves personalization quality while reducing context overhead by up to an order of magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces POPI, a user-level personalization framework that separates the problem into a shared inference model distilling heterogeneous user signals into concise natural-language preference summaries and a shared generator that conditions on these summaries. Both are trained under a unified preference-optimization objective that decomposes into generator approximation error and summary informativeness, with reinforcement learning handling the non-differentiable inference step. The natural-language interface is claimed to allow summaries inferred once per user to be reused across different generators, including frozen black-box commercial APIs. Experiments across four personalization benchmarks report general improvements in personalization quality alongside context overhead reductions of up to an order of magnitude.

Significance. If the transferability of the inferred summaries to unseen generators holds and the unified objective is shown to be non-circular, the framework could enable practical, generator-agnostic personalization that substantially reduces context length requirements for individual users across both open and proprietary LLMs.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs' lacks direct empirical support. The reported benchmarks evaluate jointly trained inference-generator pairs; no transfer experiments to unseen or black-box generators are described, leaving the generator-agnostic reuse assumption untested and load-bearing for the broader applicability asserted.
  2. [Abstract] The decomposition of the unified objective into generator approximation error and summary informativeness is asserted without the full derivation or equations. It remains unclear whether the informativeness measure is independently grounded or reduces to a quantity fitted from the same preference data used to train the generator, raising a potential circularity risk for the claimed separation of concerns.
minor comments (2)
  1. The abstract states empirical gains and context reductions but provides no quantitative results, specific baselines, or error analysis; including key numbers and a brief comparison table in the abstract or introduction would improve accessibility.
  2. Details on the stability and variance of the RL step for the non-differentiable inference are asserted but not elaborated; a short discussion or ablation on training dynamics would strengthen the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'learned summaries can be inferred once per user and reused across different generators -- including frozen, black-box commercial APIs' lacks direct empirical support. The reported benchmarks evaluate jointly trained inference-generator pairs; no transfer experiments to unseen or black-box generators are described, leaving the generator-agnostic reuse assumption untested and load-bearing for the broader applicability asserted.

    Authors: We agree that the current experiments evaluate the jointly trained inference-generator system and do not include explicit transfer tests to unseen or black-box generators. The natural-language interface is explicitly designed to support generator-agnostic reuse, as summaries are produced independently of any generator parameters. To provide direct empirical support for the claim, we will add transfer experiments applying the inferred summaries to a frozen commercial API in the revised manuscript. revision: yes

  2. Referee: [Abstract] The decomposition of the unified objective into generator approximation error and summary informativeness is asserted without the full derivation or equations. It remains unclear whether the informativeness measure is independently grounded or reduces to a quantity fitted from the same preference data used to train the generator, raising a potential circularity risk for the claimed separation of concerns.

    Authors: We acknowledge that the abstract asserts the decomposition without presenting the equations. The motivation for the decomposition is discussed in the methods, but we agree that a full derivation would clarify the separation. In the revision we will add the explicit equations showing the unified objective as the sum of generator approximation error (under the true user preference) and an informativeness term defined via mutual information between the summary and user signals. This term is computed from the preference data independently of generator parameters, avoiding circularity; we will expand this explanation in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a unified preference-optimization objective that decomposes into generator approximation error and summary informativeness, with the natural-language interface enabling reuse across generators. No equations or definitions are shown that make the informativeness measure equivalent to a fitted parameter or input data by construction, nor does any load-bearing step reduce to self-citation or renaming. The joint training and benchmark evaluations provide independent empirical content for the claims of quality improvement and context reduction, rendering the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that natural language is a lossless enough medium for preference summaries and that the RL-augmented objective simultaneously optimizes both summary quality and generation accuracy; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Natural language summaries can capture heterogeneous user signals sufficiently for reuse across generators.
    Invoked when claiming summaries can be inferred once and applied to black-box APIs.

pith-pipeline@v0.9.0 · 5710 in / 1219 out tokens · 26988 ms · 2026-05-18T05:37:02.436175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

  1. [1]

    Mistral models overview

    Mistral AI. Mistral models overview. https://docs.mistral.ai/getting-started/ models/models_overview/, 2024

  2. [2]

    Introducing claude 4 — claude sonnet 4 and claude opus 4

    Anthropic. Introducing claude 4 — claude sonnet 4 and claude opus 4. https: //www.anthropic.com/news/claude-4, 2025

  3. [3]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pp. 4447–4455. PMLR, 2024

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKin- non, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  5. [5]

    Lore: Personalizing llms via low-rank reward modeling.arXiv preprint arXiv:2504.14439, 2025

    Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, and Maryam Fazel. Lore: Personalizing llms via low-rank reward modeling.arXiv preprint arXiv:2504.14439, 2025

  6. [6]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  7. [7]

    Pal: Pluralistic alignment framework for learning from heterogeneous preferences.arXiv preprint arXiv:2406.08469, 2024

    Daiwei Chen, Yi Chen, Aniket Rege, and Ramya Korlakai Vinayak. Pal: Pluralistic alignment framework for learning from heterogeneous preferences.arXiv preprint arXiv:2406.08469, 2024

  8. [8]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  9. [9]

    Authorship attribution using author profiling classifiers.Natural Language Engineering, 29(1):110–137, 2023

    Caio Deutsch and Ivandré Paraboni. Authorship attribution using author profiling classifiers.Natural Language Engineering, 29(1):110–137, 2023

  10. [10]

    The llama 3 herd of models.arXiv e-prints, pp

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pp. arXiv–2407, 2024

  11. [11]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  12. [12]

    On the creativity of large language models.AI & society, 40(5):3785–3795, 2025

    Giorgio Franceschelli and Mirco Musolesi. On the creativity of large language models.AI & society, 40(5):3785–3795, 2025

  13. [13]

    End-to-end train- ing for recommendation with language-based user profiles.arXiv preprint arXiv:2410.18870, 2024

    Zhaolin Gao, Joyce Zhou, Yijia Dai, and Thorsten Joachims. End-to-end train- ing for recommendation with language-based user profiles.arXiv preprint arXiv:2410.18870, 2024

  14. [14]

    Hyperalign: Interpretable personalized llm alignment via hypothesis generation.arXiv preprint arXiv:2505.00038, 2025

    Cristina Garbacea and Chenhao Tan. Hyperalign: Interpretable personalized llm alignment via hypothesis generation.arXiv preprint arXiv:2505.00038, 2025

  15. [15]

    Sumrecom: A personal- ized summarization approach by learning from users’ feedback.arXiv preprint arXiv:2408.07294, 2024

    Samira Ghodratnama and Mehrdad Zakershahrak. Sumrecom: A personal- ized summarization approach by learning from users’ feedback.arXiv preprint arXiv:2408.07294, 2024

  16. [16]

    A survey on personalized alignment–the missing piece for large language models in real-world applications.arXiv preprint arXiv:2503.17003, 2025

    Jian Guan, Junfei Wu, Jia-Nan Li, Chuanqi Cheng, and Wei Wu. A survey on personalized alignment–the missing piece for large language models in real-world applications.arXiv preprint arXiv:2503.17003, 2025

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  18. [18]

    Authorship attribution methods, challenges, and future research directions: A comprehensive survey.Information, 15(3):131, 2024

    Xie He, Arash Habibi Lashkari, Nikhill Vombatkere, and Dilli Prasad Sharma. Authorship attribution methods, challenges, and future research directions: A comprehensive survey.Information, 15(3):131, 2024

  19. [19]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference opti- mization without reference model.arXiv preprint arXiv:2403.07691, 2024

  20. [20]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Technical report, Chroma, July 2025. URL https://research. trychroma. com . . . , 2025

  21. [21]

    Rali@ trec ikat 2024: Achieving personalization via retrieval fusion in conversational search.arXiv preprint arXiv:2412.07998, 2024

    Yuchen Hui, Fengran Mo, Milan Mao, and Jian-Yun Nie. Rali@ trec ikat 2024: Achieving personalization via retrieval fusion in conversational search.arXiv preprint arXiv:2412.07998, 2024

  22. [22]

    Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

    Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

  23. [23]

    Aligning to thousands of preferences via system message generalization.Advances in Neural Information Processing Systems, 37:73783–73829, 2024

    Seongyun Lee, Sue Hyun Park, Seungone Kim, and Minjoon Seo. Aligning to thousands of preferences via system message generalization.Advances in Neural Information Processing Systems, 37:73783–73829, 2024

  24. [24]

    Test-Time Alignment via Hypothesis Reweighting

    Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, and Chelsea Finn. Test-time alignment via hypothesis reweighting. arXiv preprint arXiv:2412.08812, 2024

  25. [25]

    From 1,000,000 users to every user: Scaling up personalized preference for user-level alignment.arXiv preprint arXiv:2503.15463, 2025

    Jia-Nan Li, Jian Guan, Songhao Wu, Wei Wu, and Rui Yan. From 1,000,000 users to every user: Scaling up personalized preference for user-level alignment.arXiv preprint arXiv:2503.15463, 2025

  26. [26]

    Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

    Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning.arXiv preprint arXiv:2404.02060, 2024

  27. [27]

    Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

    Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi. Personalized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133, 2024

  28. [28]

    One size doesn’t fit all: A personalized conversational tutoring agent for mathematics instruction

    Ben Liu, Jihai Zhang, Fangquan Lin, Xu Jia, and Min Peng. One size doesn’t fit all: A personalized conversational tutoring agent for mathematics instruction. In Companion Proceedings of the ACM on Web Conference 2025, pp. 2401–2410, 2025

  29. [29]

    Personality-aware student simulation for conversational intelligent tutoring systems.arXiv preprint arXiv:2404.06762, 2024

    Zhengyuan Liu, Stella Xin Yin, Geyu Lin, and Nancy F Chen. Personality-aware student simulation for conversational intelligent tutoring systems.arXiv preprint arXiv:2404.06762, 2024

  30. [30]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  31. [31]

    Revisiting group relative policy optimization: Insights into on-policy and off-policy training

    Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group relative policy optimization: Insights into on-policy and off-policy training. arXiv preprint arXiv:2505.22257, 2025

  32. [32]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/?utm_source= chatgpt.com, 2024

  33. [33]

    Gpt-4o mini: Advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https: //openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ ?utm_source=chatgpt.com, 2024

  34. [34]

    Train- ing language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Train- ing language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  35. [35]

    Personalizing reinforcement learning from human feedback with varia- tional preference learning.Advances in Neural Information Processing Systems, 37:52516–52544, 2024

    Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with varia- tional preference learning.Advances in Neural Information Processing Systems, 37:52516–52544, 2024

  36. [36]

    On natural language user profiles for transparent and scrutable recommendation

    Filip Radlinski, Krisztian Balog, Fernando Diaz, Lucas Dixon, and Ben Wedin. On natural language user profiles for transparent and scrutable recommendation. InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp. 2863–2874, 2022

  37. [37]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

  38. [38]

    Transpar- ent and scrutable recommendations using natural language user profiles.arXiv preprint arXiv:2402.05810, 2024

    Jerome Ramos, Hossen A Rahmani, Xi Wang, Xiao Fu, and Aldo Lipani. Transpar- ent and scrutable recommendations using natural language user profiles.arXiv preprint arXiv:2402.05810, 2024

  39. [39]

    Personabot: Bringing customer personas to life with llms and rag.arXiv preprint arXiv:2505.17156, 2025

    Muhammed Rizwan, Lars Carlsson, and Mohammad Loni. Personabot: Bringing customer personas to life with llms and rag.arXiv preprint arXiv:2505.17156, 2025

  40. [40]

    Item-based collaborative filtering recommendation algorithms

    Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. InProceedings of the 10th international conference on World Wide Web, pp. 285–295, 2001

  41. [41]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  42. [42]

    FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

    Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. Fspo: Few-shot preference opti- mization of synthetic preference data in llms elicits effective personalization to real users.arXiv preprint arXiv:2502.19312, 2025

  43. [43]

    A review of large language models in medical education, clinical decision support, and healthcare administration

    Josip Vrdoljak, Zvonimir Boban, Marino Vilović, Marko Kumrić, and Joško Božić. A review of large language models in medical education, clinical decision support, and healthcare administration. InHealthcare, volume 13, pp. 603. MDPI, 2025

  44. [44]

    Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

    Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023

  45. [45]

    Wikipedia contributors. Gpt-4o. https://en.wikipedia.org/wiki/GPT-4o, 2025

  46. [46]

    A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070, 2025

    Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, et al. A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070, 2025

  47. [47]

    Co-persona: Leveraging llms and expert collaboration to understand user personas through social media data analysis.arXiv preprint arXiv:2506.18269, 2025

    Min Yin, Haoyu Liu, Boyi Lian, and Chunlei Chai. Co-persona: Leveraging llms and expert collaboration to understand user personas through social media data analysis.arXiv preprint arXiv:2506.18269, 2025

  48. [48]

    Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

  49. [49]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

  50. [50]

    Personalsum: A user-subjective guided personalized sum- marization dataset for large language models.Advances in Neural Information Processing Systems, 37:99333–99346, 2024

    Lemei Zhang, Peng Liu, Marcus Henriksboe, Even Lauvrak, Jon Atle Gulla, and Heri Ramampiaro. Personalsum: A user-subjective guided personalized sum- marization dataset for large language models.Advances in Neural Information Processing Systems, 37:99333–99346, 2024

  51. [51]

    Improving personalised query reformulation with embeddings

    Xiaojuan Zhang. Improving personalised query reformulation with embeddings. Journal of Information Science, 48(4):503–523, 2022. Under review as a conference paper at WWW ’26, April 13–17, 2026, Dubai, UAE Chen et al

  52. [52]

    Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

    Siyan Zhao, John Dang, and Aditya Grover. Group preference optimization: Few-shot alignment of large language models.arXiv preprint arXiv:2310.11523, 2023

  53. [53]

    Nextquill: Causal preference modeling for enhancing llm personalization.arXiv preprint arXiv:2506.02368, 2025

    Xiaoyan Zhao, Juntao You, Yang Zhang, Wenjie Wang, Hong Cheng, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Nextquill: Causal preference modeling for enhancing llm personalization.arXiv preprint arXiv:2506.02368, 2025

  54. [54]

    arXiv preprint arXiv:2305.10425 , year=

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425, 2023

  55. [55]

    Language-based user profiles for recommendation.arXiv preprint arXiv:2402.15623, 2024

    Joyce Zhou, Yijia Dai, and Thorsten Joachims. Language-based user profiles for recommendation.arXiv preprint arXiv:2402.15623, 2024

  56. [56]

    Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

    Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chen- hao Tan. Hypothesis generation with large language models.arXiv preprint arXiv:2404.04326, 2024

  57. [57]

    user_instruction

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey.arXiv preprint arXiv:2308.07107, 2023. Appendix A Derivation of Inequality 9 We derive the information-theoretic interpretation of the summary- augmented DPO object...