Preference-Aware Rubric Learning for Personalized Evaluation

Cilin Yan; Jiayin Cai; Tat-Seng Chua; Xiaolong Jiang; Xiaoyan Zhao; Yang Zhang; Yao Hu; Yilun Qiu; Yoko Yamakata; Yuxin Chen

A learning approach extracts evaluation rubrics from user interaction histories to judge how well LLM outputs match individual preferences.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-28 22:21 UTC pith:5QROTPXV

load-bearing objection PARL frames personalized eval as learning rubrics from histories via discriminative RL, but self-validation on the same data looks circular and the abstract gives no numbers to check the claims. the 2 major comments →

arxiv 2605.31545 v1 pith:5QROTPXV submitted 2026-05-29 cs.CL

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu , Xiaoyan Zhao , Yang Zhang , Yuxin Chen , Cilin Yan , Jiayin Cai , Xiaolong Jiang , Yao Hu

show 2 more authors

Yoko Yamakata Tat-Seng Chua

This is my paper

classification cs.CL

keywords personalized evaluationrubric learningLLM alignmentpreference modelinguser consistencydiscriminative learningtext generation evaluation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard automatic metrics and LLM judges cannot handle the subjective preferences users reveal across long interaction histories. It reframes personalized evaluation as a learning problem that induces rubrics meeting three principles: they must represent the user's standards, stay consistent with past choices, and discriminate aligned responses from others. The method learns these rubrics by contrasting user-written responses against model outputs through a reinforcement learning objective and applies an internal self-validation step. If the claim holds, evaluation can proceed directly from raw histories without fresh human labels for each new output.

Core claim

The paper establishes that preference-aware rubrics can be induced directly from raw user histories by combining rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, together with a self-validation mechanism that enforces consistency with the user's demonstrated preferences.

What carries the argument

Rubric induction paired with a discriminative reinforcement learning objective that learns user-specific decision boundaries from history data.

Load-bearing premise

User interaction histories contain stable, learnable preferences that rubrics can capture and validate through internal consistency alone.

What would settle it

An experiment in which rubrics induced on one set of user histories assign lower scores to the same user's new responses than to competing model outputs on held-out interactions.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The induced rubrics identify user-aligned responses with high fidelity on real-world text generation tasks.
Rubrics learned this way generalize across different users and across tasks.
The rubrics capture stable stylistic preferences as well as fine-grained evaluative patterns.
Self-validation during learning removes the need for external human judgment to confirm rubric quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rubrics produced this way could serve as training signals to fine-tune models toward a specific user's demonstrated standards.
The same induction process might apply to multi-turn dialogue or non-text outputs if histories of those forms are available.
Internal validation could reduce reliance on crowdsourced preference data for building evaluation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

PARL frames personalized eval as learning rubrics from histories via discriminative RL, but self-validation on the same data looks circular and the abstract gives no numbers to check the claims.

read the letter

The main takeaway is that this paper treats personalized evaluation as a learning problem: it induces rubrics from raw user histories, uses a discriminative RL objective to contrast user responses against model outputs, and adds a self-validation step to check consistency. That combination is new enough to note, even if the pieces (rubrics, preference learning) exist separately.

It does a clean job stating the problem with current metrics and LLM judges, and the three principles (Representativeness, User-Consistency, Discriminativeness) give a sensible checklist. Releasing code helps anyone who wants to test the setup.

The soft spots are the lack of any quantitative results, baselines, or error analysis in the abstract, which makes it impossible to judge whether the rubrics actually work or generalize. The self-validation runs on the training histories without held-out interactions or external judgments, so the reported consistency could just be internal fitting rather than stable user preferences. The assumption that histories contain extractable, transferable preferences is doing a lot of work here.

This is for people working on user-centric LLM alignment and evaluation methods. A reader looking for new frameworks might pick up the paradigm and the RL contrast idea, but the thin evidence means it is hard to assess real impact yet.

I would send it to peer review so the full experiments and any external validation can be checked, rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 0 minor

Summary. The paper proposes PARL (Preference-Aware Rubric Learning), a framework under the 'Personalized Evaluation as Learning' paradigm. It learns evaluation rubrics directly from raw user interaction histories via rubric induction combined with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive model outputs. A self-validation step is included to enforce consistency with user preferences. The central claim is that experiments on real-world personalized text generation tasks show PARL produces high-fidelity rubrics that reliably identify user-aligned responses, generalize across users and tasks, and capture stable stylistic and evaluative patterns. Code is released for reproducibility.

Significance. If the results can be substantiated with non-circular validation, the work would meaningfully advance personalized LLM evaluation by shifting from static or generic judges to learned, user-specific rubrics grounded in interaction histories. The release of code at https://github.com/SnowCharmQ/PARL is a positive contribution to reproducibility.

major comments (2)

[Abstract] Abstract: The claim that 'Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks' is presented without any quantitative metrics, baselines, error analysis, or statistical tests. This absence makes it impossible to evaluate the strength or reliability of the reported experimental success.
[Abstract] Abstract (self-validation mechanism): The self-validation is described as ensuring consistency with the user's preferences, yet both rubric induction and validation operate on the same raw user histories without reference to held-out interactions, external human judgments, or independent benchmarks. This setup risks circularity, where reported fidelity and generalization may reflect fitting to training patterns rather than capturing stable, transferable preferences (particularly given the discriminative RL objective contrasting user responses against model outputs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks' is presented without any quantitative metrics, baselines, error analysis, or statistical tests. This absence makes it impossible to evaluate the strength or reliability of the reported experimental success.

Authors: We agree that the abstract is a high-level summary and does not include specific quantitative details. The body of the manuscript contains the full experimental results with metrics, baselines, error analyses, and statistical tests. We will revise the abstract to incorporate key quantitative highlights from the experiments. revision: yes
Referee: [Abstract] Abstract (self-validation mechanism): The self-validation is described as ensuring consistency with the user's preferences, yet both rubric induction and validation operate on the same raw user histories without reference to held-out interactions, external human judgments, or independent benchmarks. This setup risks circularity, where reported fidelity and generalization may reflect fitting to training patterns rather than capturing stable, transferable preferences (particularly given the discriminative RL objective contrasting user responses against model outputs).

Authors: We acknowledge the risk of circularity when both induction and validation draw from the same user histories. The design intentionally learns from raw interaction data, with the discriminative RL objective providing contrast against competitive model outputs to define user-specific boundaries rather than simple pattern fitting. Cross-user and cross-task generalization experiments provide evidence of stability. We will add an explicit discussion of this limitation and design rationale in the revised manuscript. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, datasets, or implementation details, so free parameters, axioms, and invented entities cannot be enumerated.

pith-pipeline@v0.9.1-grok · 5803 in / 1109 out tokens · 15619 ms · 2026-06-28T22:21:12.113734+00:00 · methodology

0 comments

read the original abstract

As Large Language Models (LLMs) evolve from general-purpose assistants to user-centric agents, personalization has become central to aligning model behavior with individual preferences, making the evaluation of personalized alignment a critical bottleneck. Existing evaluation methods-ranging from automatic metrics to LLM-as-a-judge approaches-fail to capture subjective, user-specific preferences embedded in long-term interaction histories. We identify three essential principles for reliable and effective personalized evaluation: Representativeness, User-Consistency, and Discriminativeness. To address these principles, we introduce Personalized Evaluation as Learning, a paradigm that formulates personalized evaluation as a learning problem rather than a static judgment. Under this paradigm, we propose PARL (Preference-Aware Rubric Learning for Personalized Evaluation), a framework that learns to induce preference-aware evaluation rubrics directly from raw user histories and performs a self-validation mechanism to ensure consistency with the user's preferences. PARL integrates rubric induction with a discriminative reinforcement learning objective that contrasts user-authored responses against competitive personalized model outputs, enabling the learned rubrics to capture precise, user-specific decision boundaries. Experiments on real-world personalized text generation tasks show that PARL consistently induces high-fidelity rubrics that reliably identify user-aligned responses and generalize across users and tasks, while capturing stable stylistic preferences and fine-grained evaluative patterns. To ensure reproducibility, our code is available at https://github.com/SnowCharmQ/PARL.

Figures

Figures reproduced from arXiv: 2605.31545 by Cilin Yan, Jiayin Cai, Tat-Seng Chua, Xiaolong Jiang, Xiaoyan Zhao, Yang Zhang, Yao Hu, Yilun Qiu, Yoko Yamakata, Yuxin Chen.

**Figure 2.** Figure 2: Results of LLM-as-a-judge evaluation scores across three datasets and various personalized generation methods. PARL-0 User Coverage: 100.0% PARL-A User Coverage: 98.4% PARL-B User Coverage: 99.4% 0.0 0.2 0.4 0.6 0.8 1.0 User-level Accuracy GT: 1.000 GT: 0.910 GT: 0.899 -0.745 -0.702 -0.677 -0.570 -0.213 -0.218 -0.220 -0.721 -0.686 -0.719 -0.661 -0.123 -0.224 -0.215 Non RAG Non-Think RAG-Think SFT GRPO SFT… view at source ↗

**Figure 4.** Figure 4: Intrinsic analysis of induced user rubrics across five rubric induction variants on three [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of induced rubrics on the [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 22 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

PAL: sample-efficient personalized reward modeling for pluralistic alignment

Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak. PAL: sample-efficient personalized reward modeling for pluralistic alignment. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wan...

2025
[4]

PAD: personalized alignment of llms at decoding-time

Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, and Zuozhu Liu. PAD: personalized alignment of llms at decoding-time. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, et al. Learning to self-verify makes ...

work page arXiv 2025
[5]

Pref: Reference-free evaluation of personalised text generation in llms.arXiv preprint arXiv:2508.10028,

Xiao Fu, Hossein A Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, and Aldo Lipani. Pref: Reference-free evaluation of personalised text generation in llms.arXiv preprint arXiv:2508.10028,

work page arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. 12 Preference-Aware Rubric Learning for Personalized Evaluation Zhiliang Guo, Teng Tu, Yunshan Ma, and Xun Y...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Llm2rec: Large language models are powerful embedding models for sequential recommendation

Yingzhi He, Xiaohao Liu, An Zhang, Yunshan Ma, and Tat-Seng Chua. Llm2rec: Large language models are powerful embedding models for sequential recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V .2, KDD 2025, pp. 896–907. ACM,

2025
[8]

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, and Yushun Dong. Rulers: Locked rubrics and evidence-anchored scoring for robust llm evaluation.arXiv preprint arXiv:2601.08654,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2310.11564 , year=

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564,

work page arXiv
[11]

Tagging the thought: Unlocking personalization reasoning via reinforcement learning

Song Jin, Juntian Zhang, Yong Liu, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, and Rui Yan. Tagging the thought: Unlocking personalization reasoning via reinforcement learning. arXiv preprint arXiv:2509.23140,

work page arXiv
[12]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Choi, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

2024
[13]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.),3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings,

2015
[14]

https://arxiv.org/abs/2407.11016 76

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A Rossi, Franck Dernon- court, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, et al. Longlamp: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016,

work page arXiv
[15]

Learning to rewrite prompts for personalized text generation

Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, and Michael Bendersky. Learning to rewrite prompts for personalized text generation. InProceedings of the ACM Web Conference 2024, pp. 3367–3378,

2024
[16]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a

Jiahong Liu, Wenhao Yu, Quanyu Dai, Zhongyang Li, Jieming Zhu, Menglin Yang, Tat-Seng Chua, and Irwin King. Exploring personalization shifts in representation space of llms. InKnowledgeable Foundation Models at ACL 2025, 2025a. 13 Preference-Aware Rubric Learning for Personalized Evaluation Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and...

work page arXiv 2025
[17]

Reasoning meets personalization: Unleashing the potential of large reasoning model for personalized generation

Sichun Luo, Guanzhi Deng, Jian Xu, Xiaojie Zhang, Hanxu Hou, and Linqi Song. Reasoning meets personalization: Unleashing the potential of large reasoning model for personalized generation. arXiv preprint arXiv:2505.17571,

work page arXiv
[18]

News category dataset.arXiv preprint arXiv:2209.11429,

Rishabh Misra. News category dataset.arXiv preprint arXiv:2209.11429,

work page arXiv
[19]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics

Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, et al. Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics. InProceedings of the 2025 ACM Conference on International Computing Education Research V . 1, pp. 181–195,

2025
[20]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

2024
[21]

arXiv preprint arXiv:2310.20081 , year=

Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, and Fuli Feng. Latent inter-user difference modeling for LLM personalization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 10610–10628, 2025a. Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, and Tat-Seng Chua...

work page arXiv 2025
[22]

LaMP-QA: A benchmark for personalized long-form question answering

Alireza Salemi and Hamed Zamani. LaMP-QA: A benchmark for personalized long-form question answering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 1139–1159,

2025
[23]

https://arxiv.org/abs/2501.04167

Alireza Salemi, Julian Killingback, and Hamed Zamani. ExPerT: Effective and explainable evaluation of personalized long-form text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 17516–17532, 2025a. 14 Preference-Aware Rubric Learning for Personalized Evaluation Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, W...

work page arXiv 2025
[24]

MiCRo: Mixture modeling and context-aware routing for personalized preference learning

Jingyan Shen, Jiarui Yao, Rui Yang, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, and Han Zhao. MiCRo: Mixture modeling and context-aware routing for personalized preference learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17458–17474,

2025
[25]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, pp. 1279–1297,

2025
[26]

Democratizing large language models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratizing large language models via personalized parameter-efficient fine-tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 6476–6491,

2024
[27]

arXiv preprint arXiv:2512.06690

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. Think-while-generating: On-the-fly reasoning for personalized long-form generation.arXiv preprint arXiv:2512.06690,

work page arXiv
[28]

Automated evaluation of personalized text generation using large language models

Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, and Michael Bendersky. Automated evaluation of personalized text generation using large language models. arXiv preprint arXiv:2310.11593,

work page arXiv
[29]

https://arxiv.org/abs/2504.07070

Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, et al. A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070,

work page arXiv
[30]

Qwen3 Technical Report

Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, and Xiangnan He. Per- sonalized image generation with large multimodal models. InProceedings of the ACM on Web Conference 2025, pp. 264–274, 2025a. Yiyan Xu, Jinghao Zhang, Alireza Salemi, Xinting Hu, Wenjie Wang, Fuli Feng, Hamed Zamani, Xiangnan He, and Tat-Seng Chua. Personalized genera...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Bartscore: Evaluating generated text as text generation

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.),Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 27263–27277,

2021
[32]

Prlm: Learning explicit reasoning for personalized rag via contrastive reward optimization

Kepu Zhang, Teng Shi, Weijie Yu, and Jun Xu. Prlm: Learning explicit reasoning for personalized rag via contrastive reward optimization. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 5484–5488, 2025a. 15 Preference-Aware Rubric Learning for Personalized Evaluation Pinyi Zhang, Ting-En Lin, Yuchuan Wu, ...

work page arXiv 2020
[33]

Causality-enhanced behavior sequence modeling in llms for personalized recommendation.arXiv preprint arXiv:2410.22809, 2024a

Yang Zhang, Juntao You, Yimeng Bai, Jizhi Zhang, Keqin Bao, Wenjie Wang, and Tat-Seng Chua. Causality-enhanced behavior sequence modeling in llms for personalized recommendation.arXiv preprint arXiv:2410.22809, 2024a. Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. Collm: Integrating collaborative embeddings into large language...

work page arXiv 2025
[34]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

2023
[35]

HYDRA: model factorization framework for black-box LLM personalization

Yuchen Zhuang, Haotian Sun, Yue Yu, Rushi Qiang, Qifan Wang, Chao Zhang, and Bo Dai. HYDRA: model factorization framework for black-box LLM personalization. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024,

2024
[36]

16 Preference-Aware Rubric Learning for Personalized Evaluation A LIMITATIONS The effectiveness of our rubric generator depends heavily on the quality and quantity of available user behavioral history. In cold-start scenarios, where historical signals are sparse, the model may struggle to induce sufficiently detailed and stable criteria, constraining its ...

2024
[37]

tagging the thought

D BASELINEDETAILS In this section, we provide more detailed descriptions of baselines benchmarked by PARL. To comprehensively evaluate the effectiveness and robustness of our framework, we extend our analysis beyond methods featured in Section 5.1 to include a broader suite of competitive baselines. These methods are categorized into four paradigms: ICL (...

2024
[38]

Training is conducted with a batch size of 8, a maximum prompt length of 10240 tokens, and a maximum response length of 2048 tokens

optimization objective. Training is conducted with a batch size of 8, a maximum prompt length of 10240 tokens, and a maximum response length of 2048 tokens. Optimization is carried out using Adam (Kingma & Ba,

2048
[39]

For rubric generation, we adopt the same vLLM-based deployment configuration

with nucleus sampling (p= 0.95 ), temperature 0.6, and top-k sampling (k= 20 ), producing 5 samples per prompt. For rubric generation, we adopt the same vLLM-based deployment configuration. The selected baselines for computingDiscriminative Margin Productinclude:Non,RAG,SFT,GRPO, SFT+GRPO. 19 Preference-Aware Rubric Learning for Personalized Evaluation Ta...

2025
[40]

Beyond absolute accuracy, PARL demonstrates robust discriminative power, effectively establishing a clear evaluative margin between GT and competitive baselines

Our framework, PARL, consistently ensures that authentic user-authored GT responses achieve the highestrubric-level accuracyacross all three personalized text generation tasks. Beyond absolute accuracy, PARL demonstrates robust discriminative power, effectively establishing a clear evaluative margin between GT and competitive baselines. These results conf...

2004
[41]

matches writing style

As shown in Table 16, Table 17, and Table 18, we also provide prompts used in the comparison LLM-as-a-judge experiments in Section??for reference. 22 Preference-Aware Rubric Learning for Personalized Evaluation Table 7: Detailed evaluation results of induced rubrics across three personalized text generation tasks onuser-level accuracy. Amazon ReviewLM-8B ...

work page arXiv 1918

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

PAL: sample-efficient personalized reward modeling for pluralistic alignment

Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak. PAL: sample-efficient personalized reward modeling for pluralistic alignment. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wan...

2025

[4] [4]

PAD: personalized alignment of llms at decoding-time

Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, and Zuozhu Liu. PAD: personalized alignment of llms at decoding-time. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, et al. Learning to self-verify makes ...

work page arXiv 2025

[5] [5]

Pref: Reference-free evaluation of personalised text generation in llms.arXiv preprint arXiv:2508.10028,

Xiao Fu, Hossein A Rahmani, Bin Wu, Jerome Ramos, Emine Yilmaz, and Aldo Lipani. Pref: Reference-free evaluation of personalised text generation in llms.arXiv preprint arXiv:2508.10028,

work page arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025a. 12 Preference-Aware Rubric Learning for Personalized Evaluation Zhiliang Guo, Teng Tu, Yunshan Ma, and Xun Y...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Llm2rec: Large language models are powerful embedding models for sequential recommendation

Yingzhi He, Xiaohao Liu, An Zhang, Yunshan Ma, and Tat-Seng Chua. Llm2rec: Large language models are powerful embedding models for sequential recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, V .2, KDD 2025, pp. 896–907. ACM,

2025

[8] [8]

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, and Yushun Dong. Rulers: Locked rubrics and evidence-anchored scoring for robust llm evaluation.arXiv preprint arXiv:2601.08654,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2310.11564 , year=

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564,

work page arXiv

[11] [11]

Tagging the thought: Unlocking personalization reasoning via reinforcement learning

Song Jin, Juntian Zhang, Yong Liu, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, and Rui Yan. Tagging the thought: Unlocking personalization reasoning via reinforcement learning. arXiv preprint arXiv:2509.23140,

work page arXiv

[12] [12]

Prometheus: Inducing fine- grained evaluation capability in language models

Seungone Kim, Jamin Shin, Yejin Choi, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine- grained evaluation capability in language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

2024

[13] [13]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.),3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings,

2015

[14] [14]

https://arxiv.org/abs/2407.11016 76

Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A Rossi, Franck Dernon- court, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, et al. Longlamp: A benchmark for personalized long-form text generation.arXiv preprint arXiv:2407.11016,

work page arXiv

[15] [15]

Learning to rewrite prompts for personalized text generation

Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, and Michael Bendersky. Learning to rewrite prompts for personalized text generation. InProceedings of the ACM Web Conference 2024, pp. 3367–3378,

2024

[16] [16]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a

Jiahong Liu, Wenhao Yu, Quanyu Dai, Zhongyang Li, Jieming Zhu, Menglin Yang, Tat-Seng Chua, and Irwin King. Exploring personalization shifts in representation space of llms. InKnowledgeable Foundation Models at ACL 2025, 2025a. 13 Preference-Aware Rubric Learning for Personalized Evaluation Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and...

work page arXiv 2025

[17] [17]

Reasoning meets personalization: Unleashing the potential of large reasoning model for personalized generation

Sichun Luo, Guanzhi Deng, Jian Xu, Xiaojie Zhang, Hanxu Hou, and Linqi Song. Reasoning meets personalization: Unleashing the potential of large reasoning model for personalized generation. arXiv preprint arXiv:2505.17571,

work page arXiv

[18] [18]

News category dataset.arXiv preprint arXiv:2209.11429,

Rishabh Misra. News category dataset.arXiv preprint arXiv:2209.11429,

work page arXiv

[19] [19]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics

Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Arnav Ramamoorthy, Pratyush Ghosh, Aaryan Raj Jindal, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, et al. Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics. InProceedings of the 2025 ACM Conference on International Computing Education Research V . 1, pp. 181–195,

2025

[20] [20]

Personalizing reinforcement learning from human feedback with variational preference learning

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. Personalizing reinforcement learning from human feedback with variational preference learning. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

2024

[21] [21]

arXiv preprint arXiv:2310.20081 , year=

Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, and Fuli Feng. Latent inter-user difference modeling for LLM personalization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 10610–10628, 2025a. Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, and Tat-Seng Chua...

work page arXiv 2025

[22] [22]

LaMP-QA: A benchmark for personalized long-form question answering

Alireza Salemi and Hamed Zamani. LaMP-QA: A benchmark for personalized long-form question answering. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 1139–1159,

2025

[23] [23]

https://arxiv.org/abs/2501.04167

Alireza Salemi, Julian Killingback, and Hamed Zamani. ExPerT: Effective and explainable evaluation of personalized long-form text generation. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 17516–17532, 2025a. 14 Preference-Aware Rubric Learning for Personalized Evaluation Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, W...

work page arXiv 2025

[24] [24]

MiCRo: Mixture modeling and context-aware routing for personalized preference learning

Jingyan Shen, Jiarui Yao, Rui Yang, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, and Han Zhao. MiCRo: Mixture modeling and context-aware routing for personalized preference learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 17458–17474,

2025

[25] [25]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, pp. 1279–1297,

2025

[26] [26]

Democratizing large language models via personalized parameter-efficient fine-tuning

Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, and Meng Jiang. Democratizing large language models via personalized parameter-efficient fine-tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 6476–6491,

2024

[27] [27]

arXiv preprint arXiv:2512.06690

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. Think-while-generating: On-the-fly reasoning for personalized long-form generation.arXiv preprint arXiv:2512.06690,

work page arXiv

[28] [28]

Automated evaluation of personalized text generation using large language models

Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, and Michael Bendersky. Automated evaluation of personalized text generation using large language models. arXiv preprint arXiv:2310.11593,

work page arXiv

[29] [29]

https://arxiv.org/abs/2504.07070

Zhouhang Xie, Junda Wu, Yiran Shen, Yu Xia, Xintong Li, Aaron Chang, Ryan Rossi, Sachin Kumar, Bodhisattwa Prasad Majumder, Jingbo Shang, et al. A survey on personalized and pluralistic preference alignment in large language models.arXiv preprint arXiv:2504.07070,

work page arXiv

[30] [30]

Qwen3 Technical Report

Yiyan Xu, Wenjie Wang, Yang Zhang, Biao Tang, Peng Yan, Fuli Feng, and Xiangnan He. Per- sonalized image generation with large multimodal models. InProceedings of the ACM on Web Conference 2025, pp. 264–274, 2025a. Yiyan Xu, Jinghao Zhang, Alireza Salemi, Xinting Hu, Wenjie Wang, Fuli Feng, Hamed Zamani, Xiangnan He, and Tat-Seng Chua. Personalized genera...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Bartscore: Evaluating generated text as text generation

Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.),Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 27263–27277,

2021

[32] [32]

Prlm: Learning explicit reasoning for personalized rag via contrastive reward optimization

Kepu Zhang, Teng Shi, Weijie Yu, and Jun Xu. Prlm: Learning explicit reasoning for personalized rag via contrastive reward optimization. InProceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 5484–5488, 2025a. 15 Preference-Aware Rubric Learning for Personalized Evaluation Pinyi Zhang, Ting-En Lin, Yuchuan Wu, ...

work page arXiv 2020

[33] [33]

Causality-enhanced behavior sequence modeling in llms for personalized recommendation.arXiv preprint arXiv:2410.22809, 2024a

Yang Zhang, Juntao You, Yimeng Bai, Jizhi Zhang, Keqin Bao, Wenjie Wang, and Tat-Seng Chua. Causality-enhanced behavior sequence modeling in llms for personalized recommendation.arXiv preprint arXiv:2410.22809, 2024a. Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. Collm: Integrating collaborative embeddings into large language...

work page arXiv 2025

[34] [34]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, ...

2023

[35] [35]

HYDRA: model factorization framework for black-box LLM personalization

Yuchen Zhuang, Haotian Sun, Yue Yu, Rushi Qiang, Qifan Wang, Chao Zhang, and Bo Dai. HYDRA: model factorization framework for black-box LLM personalization. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024,

2024

[36] [36]

16 Preference-Aware Rubric Learning for Personalized Evaluation A LIMITATIONS The effectiveness of our rubric generator depends heavily on the quality and quantity of available user behavioral history. In cold-start scenarios, where historical signals are sparse, the model may struggle to induce sufficiently detailed and stable criteria, constraining its ...

2024

[37] [37]

tagging the thought

D BASELINEDETAILS In this section, we provide more detailed descriptions of baselines benchmarked by PARL. To comprehensively evaluate the effectiveness and robustness of our framework, we extend our analysis beyond methods featured in Section 5.1 to include a broader suite of competitive baselines. These methods are categorized into four paradigms: ICL (...

2024

[38] [38]

Training is conducted with a batch size of 8, a maximum prompt length of 10240 tokens, and a maximum response length of 2048 tokens

optimization objective. Training is conducted with a batch size of 8, a maximum prompt length of 10240 tokens, and a maximum response length of 2048 tokens. Optimization is carried out using Adam (Kingma & Ba,

2048

[39] [39]

For rubric generation, we adopt the same vLLM-based deployment configuration

with nucleus sampling (p= 0.95 ), temperature 0.6, and top-k sampling (k= 20 ), producing 5 samples per prompt. For rubric generation, we adopt the same vLLM-based deployment configuration. The selected baselines for computingDiscriminative Margin Productinclude:Non,RAG,SFT,GRPO, SFT+GRPO. 19 Preference-Aware Rubric Learning for Personalized Evaluation Ta...

2025

[40] [40]

Beyond absolute accuracy, PARL demonstrates robust discriminative power, effectively establishing a clear evaluative margin between GT and competitive baselines

Our framework, PARL, consistently ensures that authentic user-authored GT responses achieve the highestrubric-level accuracyacross all three personalized text generation tasks. Beyond absolute accuracy, PARL demonstrates robust discriminative power, effectively establishing a clear evaluative margin between GT and competitive baselines. These results conf...

2004

[41] [41]

matches writing style

As shown in Table 16, Table 17, and Table 18, we also provide prompts used in the comparison LLM-as-a-judge experiments in Section??for reference. 22 Preference-Aware Rubric Learning for Personalized Evaluation Table 7: Detailed evaluation results of induced rubrics across three personalized text generation tasks onuser-level accuracy. Amazon ReviewLM-8B ...

work page arXiv 1918