pith. sign in

arxiv: 2508.10695 · v2 · submitted 2025-08-14 · 💻 cs.CL · cs.AI· cs.IR

Learning from Natural Language Feedback for Personalized Question Answering

Pith reviewed 2026-05-18 22:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords personalized question answeringnatural language feedbacklarge language modelsreinforcement learningpersonalizationLaMP-QAfeedback optimization
0
0 comments X

The pith

Natural language feedback replaces scalar rewards to better personalize question answering models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that natural language feedback generated from user profiles and question context supplies richer, more actionable training signals than scalar rewards for teaching large language models to personalize their answers. A sympathetic reader would care because current retrieval-augmented approaches often produce only weak guidance, limiting how well models adapt to individual users in everyday information-seeking tasks. The authors introduce the VAC framework, which alternates between training a feedback model to produce instructive natural-language comments and fine-tuning a policy model on the improved responses that result. This loop lets the policy model internalize effective personalization strategies so that it no longer needs external feedback once training ends. Experiments on the LaMP-QA benchmark across three domains show consistent gains over prior methods, and human judges rate the resulting answers higher in quality.

Core claim

Natural language feedback serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference and produces consistent improvements over the state-of-the-art on LaMP-QA.

What carries the argument

VAC framework: alternating optimization in which a feedback model generates natural language feedback conditioned on user profiles and question narratives to supervise iterative refinement of the policy model's personalized answers.

If this is right

  • The policy model internalizes personalization strategies and generates high-quality responses without any feedback at inference time.
  • Consistent and significant gains appear over prior state-of-the-art methods across the three domains in the LaMP-QA benchmark.
  • Human evaluations rate the final responses as higher quality than those from scalar-reward approaches.
  • Natural language feedback supplies more instructive optimization signals than scalar rewards alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alternating training loop could be tested on other personalization settings such as recommendation or summarization where user context is available.
  • If the feedback model generalizes across domains, the approach might enable faster adaptation to new users with limited history.
  • Model-generated natural language critiques could reduce dependence on costly human-written feedback in similar alignment pipelines.

Load-bearing premise

The generated natural language feedback is consistently high-quality, unbiased, and more informative than scalar rewards without introducing new inconsistencies or hallucinations that could degrade the policy model during alternating optimization.

What would settle it

If a controlled experiment on LaMP-QA shows that the VAC policy model fails to outperform scalar-reward baselines on automatic metrics or human preference judgments, or if performance drops when feedback quality is degraded, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2508.10695 by Alireza Salemi, Hamed Zamani.

Figure 1
Figure 1. Figure 1: Overview of the training loop and inference process in the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompts used for response generation in RAG [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt used with the feedback model in the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt used for PlanPers [30] baseline. the prompts shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of VAC in different training iterations with trained and untrained feedback model. training. The model is trained for three iterations under the same configuration as our method, with the best checkpoint from all iterations used for evaluation. This comparison enables an empir￾ical assessment of the efficiency and effectiveness of NLF-based optimization versus scalar reward-based optimization. … view at source ↗
Figure 6
Figure 6. Figure 6: Effect of feedback model size on the performance of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of optimization method for policy model in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: In 44% of the cases, VAC was preferred by the annotators for better addressing the rubric aspects. In 23% of the cases, the PlanPers baseline was favored. The remaining 33% were judged as ties. These demonstrate that VAC produces responses that are more aligned with user-specific rubric aspects from human perspective, indicating its effectiveness in personalized question answering. 5.3 Case Study This sect… view at source ↗
Figure 8
Figure 8. Figure 8: Case study illustrating the initial response, feedback, and updated response (top row), and a comparison between [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results of human evaluation between VAC and the best performing baseline, PlanPers [30]. recommendations by outlining clear signs of doneness, offering practical monitoring strategies, and providing estimated cooking times based on bone type. Presented in a well-structured format, the revised response better reflects the user’s context and yields a more personalized, informative, and user-aligned response.… view at source ↗
read the original abstract

Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VAC, a framework for personalized question answering that replaces scalar rewards with natural language feedback (NLF) generated conditioned on user profiles and question narratives. Training alternates between optimizing the feedback model and fine-tuning the policy model on the resulting responses, yielding a policy that requires no feedback at inference time. The approach is evaluated on the LaMP-QA benchmark across three domains and reports consistent improvements over prior state-of-the-art methods, corroborated by human evaluations.

Significance. If the alternating optimization maintains high-quality, unbiased NLF without introducing hallucinations or profile misinterpretations, the method could supply richer supervision than scalar rewards and improve personalization in information-seeking tasks. The LaMP-QA gains and human evaluations would then constitute a meaningful advance over RAG-plus-RL baselines. The result's impact depends on demonstrating that feedback quality does not degrade across iterations.

major comments (2)
  1. [Method] The alternating optimization procedure (described in the method) provides no explicit safeguard—such as regularization, divergence monitoring between successive feedback rounds, or periodic human filtering—against feedback-model drift or hallucination. This is load-bearing for the central claim that NLF supplies strictly richer, actionable supervision than scalar rewards.
  2. [Experiments] No quantitative assessment of NLF quality (e.g., human ratings of feedback accuracy, consistency with user profiles, or hallucination rate) is reported, nor are statistical significance tests or variance estimates supplied for the LaMP-QA improvements. These omissions prevent verification that the observed gains are robust rather than artifacts of unmonitored feedback degradation.
minor comments (1)
  1. [Abstract] The abstract states that LaMP-QA consists of 'three diverse domains' but does not name them; adding this detail would improve clarity for readers unfamiliar with the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below with clarifications from the current work and indicate where revisions will be made to improve the presentation of safeguards and evaluation details.

read point-by-point responses
  1. Referee: [Method] The alternating optimization procedure (described in the method) provides no explicit safeguard—such as regularization, divergence monitoring between successive feedback rounds, or periodic human filtering—against feedback-model drift or hallucination. This is load-bearing for the central claim that NLF supplies strictly richer, actionable supervision than scalar rewards.

    Authors: We agree that safeguards against potential drift or hallucination in the alternating loop are important for the central claim. The VAC method initializes the feedback model from a strong pretrained LLM and generates NLF explicitly conditioned on user profiles and question narratives to promote grounded, profile-consistent feedback. The alternating process allows the policy to internalize effective strategies so that feedback is not needed at inference. While explicit regularization or divergence monitoring was not implemented or reported in the original submission, the human evaluations on final responses show consistent gains over baselines, suggesting the process did not suffer from severe degradation. In the revision we will add a dedicated paragraph in the method section discussing these risks and outlining simple monitoring approaches (e.g., periodic consistency checks) as future safeguards, without retroactively claiming they were applied. revision: partial

  2. Referee: [Experiments] No quantitative assessment of NLF quality (e.g., human ratings of feedback accuracy, consistency with user profiles, or hallucination rate) is reported, nor are statistical significance tests or variance estimates supplied for the LaMP-QA improvements. These omissions prevent verification that the observed gains are robust rather than artifacts of unmonitored feedback degradation.

    Authors: We acknowledge that direct quantitative assessment of NLF quality and statistical details for the automatic metrics were not included. The manuscript reports human evaluations only on the final personalized responses, which provide indirect support for the training signal quality. We will revise the experiments section to add human ratings on a sampled set of NLF instances for accuracy, profile consistency, and hallucination rate. We will also report standard deviations across multiple runs and include statistical significance tests (e.g., paired t-tests) for the LaMP-QA improvements to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: new training procedure evaluated on external benchmark with no self-referential reductions

full rationale

The paper presents VAC as a novel alternating optimization framework that replaces scalar rewards with natural language feedback generated from user profiles and question narratives. It describes iterative refinement of a policy model on an external LaMP-QA benchmark across three domains, with reported gains over prior SOTA and human evaluations. No equations, derivations, or first-principles claims appear that reduce by construction to fitted inputs, self-citations, or renamed patterns. The method is self-contained against the independent benchmark, with no load-bearing self-citation chains or definitional loops in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that natural language feedback provides superior supervision to scalar rewards for personalization; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Natural language feedback generated from user profiles and question narratives constitutes richer and more actionable supervision than scalar rewards for personalizing LLM outputs.
    This premise underpins the replacement of scalar rewards and the alternating training procedure described in the abstract.

pith-pipeline@v0.9.0 · 5741 in / 1270 out tokens · 29846 ms · 2026-05-18T22:52:13.206401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

  1. [1]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/abs/1611.09268

  2. [2]

    Bowman, Kyunghyun Cho, and Ethan Perez

    Angelica Chen, Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. 2024. Learning from Natural Language Feedback. Transactions on Machine Learning Research (2024). https://openreview.net/forum?id=xo3hI5MwvU

  3. [3]

    Bowman, Kyunghyun Cho, and Ethan Perez

    Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Sh- ern Chan, Samuel R. Bowman, Kyunghyun Cho, and Ethan Perez. 2024. Improving Code Generation by Training with Natural Language Feedback. arXiv:2303.16749 [cs.SE] https://arxiv.org/abs/2303.16749

  4. [4]

    Zhicheng Dou, Ruihua Song, and Ji-Rong Wen. 2007. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). Association for Computing Machinery, New York, NY, USA, 581–590. doi:10.1145/1242572. 1242651

  5. [5]

    Andrew Fowler, Kurt Partridge, Ciprian Chelba, Xiaojun Bi, Tom Ouyang, and Shumin Zhai. 2015. Effects of Language Modeling and Its Personalization on Touchscreen Typing Performance. In Proceedings of the 33rd Annual ACM Con- ference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, N...

  6. [6]

    Qian Guo, Wei Chen, and Huaiyu Wan. 2021. AOL4PS: A Large-scale Data Set for Personalized Search. Data Intelligence 3, 4 (10 2021), 548–567. arXiv:https://direct.mit.edu/dint/article-pdf/3/4/548/1968580/dint_a_00104.pdf doi:10.1162/dint_a_00104

  7. [7]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. InInternational Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH

  8. [8]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations . https: //openreview.net/forum?id=nZeVKeeFYf9

  9. [9]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised Dense Infor- mation Retrieval with Contrastive Learning. Transactions on Machine Learning Research (2022). https://openreview.net/forum?id=jKN1pXi7b0

  10. [10]

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu

  11. [11]

    arXiv:2310.11564

    Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging. arXiv:2310.11564

  12. [12]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- mization. CoRR abs/1412.6980 (2014). https://api.semanticscholar.org/CorpusID: 6628106

  13. [13]

    Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani

    Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Chien Van Nguyen, Thien Huu Nguyen, and Hamed Zamani. 2024. LongLaMP: A Benchmark for Personalized Long-form Text Generation. arXiv:2407.11016 [cs.CL] https://arxiv.org/abs/2407.11016

  14. [14]

    Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, and Michael Ben- dersky. 2024. Learning to Rewrite Prompts for Personalized Text Genera- tion. In Proceedings of the ACM on Web Conference 2024 (WWW ’24) . ACM. doi:10.1145/3589334.3645408

  15. [15]

    Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurthi Amba Hombaiah, Yi Liang, and Michael Bendersky. 2023. Teach LLMs to Personalize – An Approach inspired by Writing Education. arXiv:2308.07968

  16. [16]

    Alphaone: Reasoning models thinking slow and fast at test time

    Yanyang Li, Michael R. Lyu, and Liwei Wang. 2025. Learning to Reason from Feedback at Test-Time. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, ...

  17. [17]

    Jiongnan Liu, Zhicheng Dou, Guoyu Tang, and Sulong Xu. 2023. JDsearch: A Personalized Product Search Dataset with Real Queries and Full Interactions. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2945–2...

  18. [18]

    Zhuoran Lu, Sheshera Mysore, Tara Safavi, Jennifer Neville, Longqi Yang, and Mengting Wan. 2024. Corporate Communication Companion (CCC): An LLM- empowered Writing Assistant for Workplace Social Media. arXiv:2405.04656

  19. [19]

    Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, and Jiebo Luo. 2024. LLM-Rec: Personalized Recommendation via Prompting Large Language Models. In Findings of the Asso- ciation for Computational Linguistics: NAACL 2024 , Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computa...

  20. [20]

    Chunyan Mao, Shuaishuai Huang, Mingxiu Sui, Haowei Yang, and Xueshe Wang

  21. [21]

    arXiv:2410.09923 [cs.IR] https://arxiv.org/abs/ 2410.09923

    Analysis and Design of a Personalized Recommendation System Based on a Dynamic User Interest Model. arXiv:2410.09923 [cs.IR] https://arxiv.org/abs/ 2410.09923

  22. [22]

    Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jennifer Neville, and Tara Safavi

  23. [23]

    arXiv:2311.09180

    PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers. arXiv:2311.09180

  24. [24]

    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherni- avskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon- dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao,...

  25. [25]

    Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2024. REFINER: Reasoning Feedback on Intermedi- ate Representations. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Yvette Graham and Matthew Purver (Eds.)...

  26. [26]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  27. [27]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Thirty-seventh Conference on Neural Infor- mation Processing Systems. https://openreview.net/forum?id=HPuSIXJaa9

  28. [28]

    Alireza Salemi, Surya Kallumadi, and Hamed Zamani. 2024. Optimization Methods for Personalizing Large Language Models through Retrieval Augmen- tation. In Proceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval (Washington DC, USA) (SI- GIR ’24). Association for Computing Machinery, New York, NY, U...

  29. [29]

    Alireza Salemi, Julian Killingback, and Hamed Zamani. 2025. ExPerT: Effec- tive and Explainable Evaluation of Personalized Long-Form Text Generation. In Findings of the Association for Computational Linguistics: ACL 2025 , Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienn...

  30. [30]

    Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Tao Chen, Zhuowan Li, Michael Bendersky, and Hamed Zamani. 2025. Reasoning-Enhanced Self-Training for Long-Form Personalized Text Genera- tion. arXiv:2501.04167 [cs.CL] https://arxiv.org/abs/2501.04167

  31. [31]

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. LaMP: When Large Language Models Meet Personalization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thai...

  32. [32]

    Alireza Salemi and Hamed Zamani. 2025. Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy) (ICTIR ’25). Association for Computing...

  33. [33]

    Alireza Salemi and Hamed Zamani. 2025. LaMP-QA: A Benchmark for Personal- ized Long-form Question Answering. In Proceedings of the The 2025 Conference Learning from Natural Language Feedback for Personalized Question Answering Preprint, on Empirical Methods in Natural Language Processing (to appear) . Association for Computational Linguistics, Suzhou, China

  34. [34]

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pen...

  35. [35]

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. InProceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 3104–3112

  36. [36]

    Zhaoxuan Tan, Zheyuan Liu, and Meng Jiang. 2024. Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts. arXiv:2406.10471

  37. [37]

    Zhiheng Xi, Dingwen Yang, Jixuan Huang, Jiafu Tang, Guanyu Li, Yiwen Ding, Wei He, Boyang Hong, Shihan Do, Wenyu Zhan, Xiao Wang, Rui Zheng, Tao Ji, Xiaowei Shi, Yitao Zhai, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Zuxuan Wu, Qi Zhang, Xipeng Qiu, Xuanjing Huang, and Yu-Gang Jiang. 2024. Enhancing LLM Reasoning via Critique Models with Test-Ti...

  38. [38]

    Yiyan Xu, Jinghao Zhang, Alireza Salemi, Xinting Hu, Wenjie Wang, Fuli Feng, Hamed Zamani, Xiangnan He, and Tat-Seng Chua. 2025. Personalized Generation In Large Model Era: A Survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham...

  39. [39]

    Gui-Rong Xue, Jie Han, Yong Yu, and Qiang Yang. 2009. User Language Model for Collaborative Personalized Search. ACM Trans. Inf. Syst. 27, 2, Article 11 (mar 2009), 28 pages. doi:10.1145/1462198.1462203

  40. [40]

    Wang, Wen-tau Yih, and Ziyu Yao

    Hao Yan, Saurabh Srivastava, Yintao Tai, Sida I. Wang, Wen-tau Yih, and Ziyu Yao. 2023. Learning to Simulate Natural Language Feedback for Interactive Semantic Parsing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (Eds.). Associatio...

  41. [41]

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. 2024. Large Language Models as Optimizers. In The Twelfth International Conference on Learning Representations . https://openreview.net/ forum?id=Bb4VGOWELI

  42. [42]

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. 2025. Optimizing generative AI by backpropa- gating language model feedback. Nature 639 (2025), 609–616

  43. [43]

    Kai Zhang, Yangyang Kang, Fubang Zhao, and Xiaozhong Liu. 2024. LLM- based Medical Assistant Personalization with Short- and Long-Term Memory Coordination. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and S...