Tuning Language Models for Robust Prediction of Diverse User Behaviors

Chen Yang; Fanjin Meng; Haisheng Lu; Hong Chen; Jiahui Gong; Jingtao Ding; Yong Li; Zuojian Wang

arxiv: 2505.17682 · v2 · submitted 2025-05-23 · 💻 cs.CL · cs.AI

Tuning Language Models for Robust Prediction of Diverse User Behaviors

Fanjin Meng , Jingtao Ding , Jiahui Gong , Chen Yang , Hong Chen , Zuojian Wang , Haisheng Lu , Yong Li This is my paper

Pith reviewed 2026-05-19 13:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords user behavior predictionlarge language modelsfine-tuninglong-tailed behaviorsanchor behaviorstail behaviorsfew-shot learning

0 comments

The pith

BehaviorLM's two-stage fine-tuning lets LLMs predict both common and rare user behaviors without overfitting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard fine-tuning of large language models for user behavior prediction causes them to overfit to frequent anchor behaviors and lose ground on uncommon tail behaviors. BehaviorLM counters this with a first stage that tunes only on anchors while keeping general knowledge intact, then a second stage that retrains on a difficulty-balanced mix of all behaviors. This setup is meant to let the models draw on their pre-trained behavioral knowledge so they can handle tail behaviors well even with few examples. A reader would care because real-world assistants need to serve the full range of user actions, not just the popular ones.

Core claim

BehaviorLM is a progressive fine-tuning method in which LLMs are first tuned on anchor behaviors to retain general behavioral knowledge and then tuned on a balanced subset of behaviors chosen according to sample difficulty; this process yields robust prediction of both anchor and tail behaviors and allows effective few-shot mastery of tail behaviors by leveraging the LLM's pre-trained knowledge, as shown on two real-world datasets.

What carries the argument

BehaviorLM, the two-stage progressive fine-tuning process that first anchors on frequent behaviors and then balances the full set by sample difficulty.

If this is right

Anchor behavior predictions stay strong while tail predictions improve.
LLMs can master tail behaviors with few-shot examples by drawing on pre-trained knowledge.
The method applies to real-world user behavior datasets without sacrificing common-case performance.
Difficulty-based balancing avoids the overfitting typical of standard fine-tuning on imbalanced data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged balancing could extend to other long-tailed tasks such as rare-event detection in recommendation systems.
If difficulty selection proves robust, it might reduce reliance on manually curated balanced training sets for domain adaptation of LLMs.
Testing the approach on additional datasets with different tail distributions would show whether the gains hold beyond the reported cases.

Load-bearing premise

Selecting a balanced subset by sample difficulty improves tail predictions without harming anchor performance or creating new overfitting or selection problems.

What would settle it

If experiments on the two real-world datasets show that tail behavior accuracy fails to improve or anchor accuracy drops after the balanced second stage, the benefit of the progressive approach would be disproved.

Figures

Figures reproduced from arXiv: 2505.17682 by Chen Yang, Fanjin Meng, Haisheng Lu, Hong Chen, Jiahui Gong, Jingtao Ding, Yong Li, Zuojian Wang.

**Figure 1.** Figure 1: (a) Empirical distribution of user behaviors in the Behavior dataset: "Anchor Behaviors" occur more than 1% of the time, while [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The BehaviorLM framework, with a progressive fine-tuning approach [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The effect of behavioral knowledge under different model size (1.5B, 8B, 70B), in terms of performance robustness across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between BehaviorLM and a non-LLM transformer-based method under different sizes of training data. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison between fine-tuning on all behaviors, anchor behaviors and tail behaviors. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed statistics of the Behavior Dataset. The behaviors are categorized into (a) high, (b) medium, and (c) low frequencies [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent ``anchor'' behaviors, reducing their ability to predict less common ``tail'' behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BehaviorLM's two-stage fine-tuning with difficulty balancing is a reasonable practical idea but the abstract gives almost no experimental controls or numbers, so the central claim is hard to evaluate yet.

read the letter

The paper's core move is a progressive fine-tuning schedule: first train on frequent anchor behaviors to keep general knowledge, then switch to a balanced subset chosen by sample difficulty to lift tail behavior prediction. That specific combination does not appear in the prior work they cite, so the technique itself is new. It targets a real pain point in user modeling for assistants and recommenders, where long-tailed actions matter for actual utility, and the few-shot angle for tails is worth testing. The abstract claims solid results on two real-world datasets without hurting anchor performance, which would be useful if it holds. The main soft spot is exactly the one the stress-test flags. If difficulty is measured from the stage-one model (loss or entropy), then tail items will naturally score higher, so the balancing step may just re-weight already-hard examples rather than truly equalizing the distribution. Without a control like a random balanced subset or frequency-stratified selection, any stability on anchors could be an artifact. The abstract also omits all metrics, baselines, error bars, and the exact definition of difficulty or how the subset is built, which makes the positive claim rest on unshown details. The work is empirical and avoids circular fitting, so that part is clean. This is for researchers doing LLM adaptation in behavioral prediction or recsys; a reader already working on long-tail issues could pick up the progressive schedule as a starting point. It deserves a serious referee because the problem is practical and the method is concrete enough to check, even if the current write-up needs more controls and numbers. I would send it out for review rather than desk reject, but with a clear request for the missing experimental details and an ablation on the balancing procedure.

Referee Report

2 major / 2 minor

Summary. The paper introduces BehaviorLM, a two-stage progressive fine-tuning method for LLMs to predict user behaviors in long-tailed distributions. Stage 1 fine-tunes on frequent anchor behaviors while preserving general knowledge; Stage 2 fine-tunes on a balanced subset of behaviors selected by sample difficulty to boost tail behavior prediction without degrading anchor performance. The authors claim that experiments on two real-world datasets show robust prediction of both anchor and tail behaviors, along with effective leveraging of LLM behavioral knowledge for few-shot tail prediction.

Significance. If the central claims hold under rigorous controls, the work could meaningfully advance LLM adaptation for imbalanced behavioral prediction tasks relevant to intelligent assistants. It offers a practical progressive schedule that attempts to retain pre-trained knowledge while addressing overfitting to frequent behaviors, with potential for broader application in long-tailed NLP settings.

major comments (2)

[§3.2] §3.2 (Stage 2 fine-tuning and balanced subset construction): The description of how sample difficulty is computed is insufficient to establish independence from the anchor/tail distinction. If difficulty derives from stage-1 model loss or entropy, it will be systematically higher for tail items by construction, so the balancing step risks simply re-weighting toward already-hard examples rather than equalizing the distribution. This directly threatens the load-bearing claim that the schedule improves tail performance while leaving anchor performance intact. Explicit ablations (random balanced subset, frequency-stratified subset, or difficulty from a held-out model) are required.
[§4] §4 (Experimental results): No quantitative metrics, baselines, error bars, or details on balanced-subset construction and difficulty scoring appear in the reported results. Without these, the positive claims on two datasets cannot be evaluated for robustness or compared to standard fine-tuning or other long-tail methods.

minor comments (2)

[Abstract] Abstract: The summary of results would be strengthened by including at least one key quantitative finding (e.g., accuracy delta on tail behaviors) rather than qualitative statements alone.
[§2] Notation: The terms 'anchor' and 'tail' behaviors are used without an explicit frequency threshold or percentile definition; adding this in §2 would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Stage 2 fine-tuning and balanced subset construction): The description of how sample difficulty is computed is insufficient to establish independence from the anchor/tail distinction. If difficulty derives from stage-1 model loss or entropy, it will be systematically higher for tail items by construction, so the balancing step risks simply re-weighting toward already-hard examples rather than equalizing the distribution. This directly threatens the load-bearing claim that the schedule improves tail performance while leaving anchor performance intact. Explicit ablations (random balanced subset, frequency-stratified subset, or difficulty from a held-out model) are required.

Authors: We agree that the current description of sample difficulty computation in §3.2 requires clarification to demonstrate independence from the anchor/tail distinction. In the revised manuscript, we will expand this section with a precise, formal definition of the difficulty metric and its computation procedure. To directly address the potential bias concern, we will also add the requested ablations: (1) a random balanced subset, (2) a frequency-stratified subset, and (3) difficulty scores computed from a held-out model. These additional experiments will allow us to show whether the observed gains in tail performance stem from the progressive schedule itself rather than from inadvertently selecting harder examples. revision: yes
Referee: [§4] §4 (Experimental results): No quantitative metrics, baselines, error bars, or details on balanced-subset construction and difficulty scoring appear in the reported results. Without these, the positive claims on two datasets cannot be evaluated for robustness or compared to standard fine-tuning or other long-tail methods.

Authors: We acknowledge that the experimental results section would benefit from more complete quantitative reporting. In the revised manuscript, we will augment §4 with explicit quantitative metrics (including accuracy, macro-F1, and per-category performance for anchor versus tail behaviors), comparisons to standard fine-tuning as well as other long-tail adaptation baselines, error bars computed over multiple random seeds, and expanded details on balanced-subset construction and the difficulty scoring method. These additions will enable readers to assess robustness and facilitate direct comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical two-stage method validated externally

full rationale

The paper describes a procedural fine-tuning schedule (stage 1 on anchors, stage 2 on difficulty-balanced subset) and reports experimental results on two real-world datasets. No equations, predictions, or uniqueness claims reduce by construction to fitted parameters or self-citations. The central claims rest on observed performance metrics rather than internal redefinitions or self-referential derivations. This is a standard empirical ML paper whose validity is testable against held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained LLMs already encode useful behavioral knowledge and that sample difficulty provides a reliable signal for balancing without introducing bias. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLMs pretrained on vast corpora contain rich behavioral knowledge that can be preserved during fine-tuning on anchor behaviors
Explicitly invoked in the abstract as the reason LLMs offer promise for this task.

pith-pipeline@v0.9.0 · 5695 in / 1250 out tokens · 81617 ms · 2026-05-19T13:43:33.454161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We divide the instruction fine-tuning data Dins into two parts: Da_ins ... and Dt_ins ... multi-task fine-tuning approach ... auxiliary conversation dataset Cins

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 9 internal anchors

[1]

Hervé Abdi and Lynne J Williams. 2010. Principal component analysis.Wiley interdisciplinary reviews: computational statistics2, 4 (2010), 433–459. Manuscript submitted to ACM 22 Fanjin Meng et al

work page 2010
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems. 1007–1014

work page 2023
[4]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international conference on machine learning. 41–48

work page 2009
[5]

Tom B Brown. 2020. Language models are few-shot learners.arXiv preprint arXiv:2005.14165(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Hyunji Chung and Sangjin Lee. 2018. Intelligent virtual assistant knows your life.arXiv preprint arXiv:1803.00466(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Shijie Geng, Zuohui Fu, Juntao Tan, Yingqiang Ge, Gerard De Melo, and Yongfeng Zhang. 2022. Path language modeling over knowledge graphsfor explainable recommendation. InProceedings of the ACM Web Conference 2022. 946–955

work page 2022
[9]

Jiahui Gong, Jingtao Ding, Fanjin Meng, Guilong Chen, Hong Chen, Shen Zhao, Haisheng Lu, and Yong Li. 2024. A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 896–907

work page 2024
[10]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Pedram Hosseini, Jessica M Sin, Bing Ren, Bryceton G Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour. 2024. A benchmark for long-form medical question answering.arXiv preprint arXiv:2411.09834(2024)

work page arXiv 2024
[12]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, and Xiang Wang. 2025. AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings.arXiv preprint arXiv:2504.19218(2025)

work page arXiv 2025
[14]

Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou. 2024. Enhancing sequential recommendation via llm-based semantic embedding learning. InCompanion Proceedings of the ACM Web Conference 2024. 103–111

work page 2024
[15]

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206

work page 2018
[16]

Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large language models meet collaborative filtering: an efficient all-round LLM-based recommender system. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1395–1406

work page 2024
[17]

Yuxuan Lei, Jianxun Lian, Jing Yao, Xu Huang, Defu Lian, and Xing Xie. 2024. RecExplainer: Aligning Large Language Models for Explaining Recommendation Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1530–1541

work page 2024
[18]

Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text is all you need: Learning language representations for sequential recommendation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1258–1267

work page 2023
[19]

Tong Li, Tong Xia, Huandong Wang, Zhen Tu, Sasu Tarkoma, Zhu Han, and Pan Hui. 2022. Smartphone app usage analysis: datasets, methods, and applications.IEEE Communications Surveys & Tutorials24, 2 (2022), 937–966

work page 2022
[20]

Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. Llara: Large language-recommendation assistant. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1785–1795

work page 2024
[21]

Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2024. Bridging Items and Language: A Transition Paradigm for Large Language Model-Based Recommendation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1816–1826

work page 2024
[22]

Qidong Liu, Xian Wu, Yejing Wang, Zijian Zhang, Feng Tian, Yefeng Zheng, and Xiangyu Zhao. 2024. LLM-ESR: Large Language Models Enhancement for Long-tailed Sequential Recommendation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[23]

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering35, 1 (2021), 857–876

work page 2021
[24]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-scale long-tailed recognition in an open world. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2537–2546

work page 2019
[25]

Wenyu Mao, Jiancan Wu, Weijian Chen, Chongming Gao, Xiang Wang, and Xiangnan He. 2025. Reinforced prompt personalization for recommen- dation with large language models.ACM Transactions on Information Systems43, 3 (2025), 1–27

work page 2025
[26]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[27]

Chaoyi Pu, Zhiang Wu, Hui Chen, Kai Xu, and Jie Cao. 2018. A Sequential Recommendation for Mobile Apps: What Will User Click Next App?. In 2018 IEEE International Conference on Web Services (ICWS). 243–248. doi:10.1109/ICWS.2018.00038

work page doi:10.1109/icws.2018.00038 2018
[28]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog1, 8 (2019), 9. Manuscript submitted to ACM Tuning Language Models for Robust Prediction of Diverse User Behaviors 23

work page 2019
[29]

Barbara Rychalska, Szymon Lukasik, and Jacek Dabrowski. 2023. Synerise Monad: A Foundation Model for Behavioral Event Data. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3344–3348

work page 2023
[30]

Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. 2023. Using sequences of life-events to predict human lives.Nature Computational Science(2023), 1–14

work page 2023
[31]

Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

work page 2023
[32]

Chenyang Shao, Fengli Xu, Bingbing Fan, Jingtao Ding, Yuan Yuan, Meng Wang, and Yong Li. 2024. Beyond Imitation: Generating Human Mobility from Context-aware Reasoning with Large Language Models.arXiv preprint arXiv:2402.09836(2024)

work page arXiv 2024
[33]

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin-Yan Han, and Yu-Feng Li. 2024. Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts. InForty-first International Conference on Machine Learning

work page 2024
[34]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450

work page 2019
[35]

Amrita S Tulshan and Sudhir Namdeorao Dhage. 2019. Survey on virtual assistant: Google assistant, siri, cortana, alexa. InAdvances in Signal Processing and Intelligent Recognition Systems: 4th International Symposium SIRS 2018, Bangalore, India, September 19–22, 2018, Revised Selected Papers

work page 2019
[36]

Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng Hou, and Julian McAuley. 2024. Coral: collaborative retrieval-augmented large language models improve long-tail recommendation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3391–3401

work page 2024
[37]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2024. A survey on large language models for recommendation.World Wide Web27, 5 (2024), 60

work page 2024
[38]

Yu Xia, Rui Zhong, Hao Gu, Wei Yang, Chi Lu, Peng Jiang, and Kun Gai. 2025. Hierarchical tree search-based user lifelong behavior modeling on large language model. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1758–1767

work page 2025
[39]

Haoran Xin, Ying Sun, Chao Wang, and Hui Xiong. 2025. Llmcdsr: Enhancing cross-domain sequential recommendation with large language models. ACM Transactions on Information Systems(2025)

work page 2025
[40]

Dingqi Yang, Daqing Zhang, Vincent W Zheng, and Zhiyong Yu. 2014. Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs.IEEE Transactions on Systems, Man, and Cybernetics: Systems45, 1 (2014), 129–142

work page 2014
[41]

Hongtao Zhang and Lingcheng Dai. 2018. Mobility prediction: A survey on state-of-the-art schemes and future applications.IEEE access7 (2018), 802–822

work page 2018
[42]

Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2025. Recommendation as instruction following: A large language model empowered recommendation approach.ACM Transactions on Information Systems43, 5 (2025), 1–37

work page 2025
[43]

Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives.ACM computing surveys (CSUR)52, 1 (2019), 1–38

work page 2019
[44]

Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. 2025. CoLLM: Integrating Collaborative Embeddings Into Large Language Models for Recommendation.IEEE Transactions on Knowledge & Data Engineering01 (2025), 1–12

work page 2025
[45]

Yi Zhang, Yiwen Zhang, Yu Wang, Tong Chen, and Hongzhi Yin. 2025. ProEx: A Unified Framework Leveraging Large Language Model with Profile Extrapolation for Recommendation.arXiv preprint arXiv:2512.00679(2025)

work page arXiv 2025
[46]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.18223(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, et al. 2024. Recommender systems in the era of large language models (llms).IEEE Transactions on Knowledge and Data Engineering(2024)

work page 2024
[48]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Hervé Abdi and Lynne J Williams. 2010. Principal component analysis.Wiley interdisciplinary reviews: computational statistics2, 4 (2010), 433–459. Manuscript submitted to ACM 22 Fanjin Meng et al

work page 2010

[2] [2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. InProceedings of the 17th ACM Conference on Recommender Systems. 1007–1014

work page 2023

[4] [4]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international conference on machine learning. 41–48

work page 2009

[5] [5]

Tom B Brown. 2020. Language models are few-shot learners.arXiv preprint arXiv:2005.14165(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

Hyunji Chung and Sangjin Lee. 2018. Intelligent virtual assistant knows your life.arXiv preprint arXiv:1803.00466(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Shijie Geng, Zuohui Fu, Juntao Tan, Yingqiang Ge, Gerard De Melo, and Yongfeng Zhang. 2022. Path language modeling over knowledge graphsfor explainable recommendation. InProceedings of the ACM Web Conference 2022. 946–955

work page 2022

[9] [9]

Jiahui Gong, Jingtao Ding, Fanjin Meng, Guilong Chen, Hong Chen, Shen Zhao, Haisheng Lu, and Yong Li. 2024. A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 896–907

work page 2024

[10] [10]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Pedram Hosseini, Jessica M Sin, Bing Ren, Bryceton G Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour. 2024. A benchmark for long-form medical question answering.arXiv preprint arXiv:2411.09834(2024)

work page arXiv 2024

[12] [12]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, and Xiang Wang. 2025. AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings.arXiv preprint arXiv:2504.19218(2025)

work page arXiv 2025

[14] [14]

Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou. 2024. Enhancing sequential recommendation via llm-based semantic embedding learning. InCompanion Proceedings of the ACM Web Conference 2024. 103–111

work page 2024

[15] [15]

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM). IEEE, 197–206

work page 2018

[16] [16]

Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park. 2024. Large language models meet collaborative filtering: an efficient all-round LLM-based recommender system. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1395–1406

work page 2024

[17] [17]

Yuxuan Lei, Jianxun Lian, Jing Yao, Xu Huang, Defu Lian, and Xing Xie. 2024. RecExplainer: Aligning Large Language Models for Explaining Recommendation Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1530–1541

work page 2024

[18] [18]

Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text is all you need: Learning language representations for sequential recommendation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1258–1267

work page 2023

[19] [19]

Tong Li, Tong Xia, Huandong Wang, Zhen Tu, Sasu Tarkoma, Zhu Han, and Pan Hui. 2022. Smartphone app usage analysis: datasets, methods, and applications.IEEE Communications Surveys & Tutorials24, 2 (2022), 937–966

work page 2022

[20] [20]

Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2024. Llara: Large language-recommendation assistant. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1785–1795

work page 2024

[21] [21]

Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2024. Bridging Items and Language: A Transition Paradigm for Large Language Model-Based Recommendation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1816–1826

work page 2024

[22] [22]

Qidong Liu, Xian Wu, Yejing Wang, Zijian Zhang, Feng Tian, Yefeng Zheng, and Xiangyu Zhao. 2024. LLM-ESR: Large Language Models Enhancement for Long-tailed Sequential Recommendation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[23] [23]

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering35, 1 (2021), 857–876

work page 2021

[24] [24]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-scale long-tailed recognition in an open world. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2537–2546

work page 2019

[25] [25]

Wenyu Mao, Jiancan Wu, Weijian Chen, Chongming Gao, Xiang Wang, and Xiangnan He. 2025. Reinforced prompt personalization for recommen- dation with large language models.ACM Transactions on Information Systems43, 3 (2025), 1–27

work page 2025

[26] [26]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022

[27] [27]

Chaoyi Pu, Zhiang Wu, Hui Chen, Kai Xu, and Jie Cao. 2018. A Sequential Recommendation for Mobile Apps: What Will User Click Next App?. In 2018 IEEE International Conference on Web Services (ICWS). 243–248. doi:10.1109/ICWS.2018.00038

work page doi:10.1109/icws.2018.00038 2018

[28] [28]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog1, 8 (2019), 9. Manuscript submitted to ACM Tuning Language Models for Robust Prediction of Diverse User Behaviors 23

work page 2019

[29] [29]

Barbara Rychalska, Szymon Lukasik, and Jacek Dabrowski. 2023. Synerise Monad: A Foundation Model for Behavioral Event Data. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3344–3348

work page 2023

[30] [30]

Germans Savcisens, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. 2023. Using sequences of life-events to predict human lives.Nature Computational Science(2023), 1–14

work page 2023

[31] [31]

Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models.Nature623, 7987 (2023), 493–498

work page 2023

[32] [32]

Chenyang Shao, Fengli Xu, Bingbing Fan, Jingtao Ding, Yuan Yuan, Meng Wang, and Yong Li. 2024. Beyond Imitation: Generating Human Mobility from Context-aware Reasoning with Large Language Models.arXiv preprint arXiv:2402.09836(2024)

work page arXiv 2024

[33] [33]

Jiang-Xin Shi, Tong Wei, Zhi Zhou, Jie-Jing Shao, Xin-Yan Han, and Yu-Feng Li. 2024. Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts. InForty-first International Conference on Machine Learning

work page 2024

[34] [34]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management. 1441–1450

work page 2019

[35] [35]

Amrita S Tulshan and Sudhir Namdeorao Dhage. 2019. Survey on virtual assistant: Google assistant, siri, cortana, alexa. InAdvances in Signal Processing and Intelligent Recognition Systems: 4th International Symposium SIRS 2018, Bangalore, India, September 19–22, 2018, Revised Selected Papers

work page 2019

[36] [36]

Junda Wu, Cheng-Chun Chang, Tong Yu, Zhankui He, Jianing Wang, Yupeng Hou, and Julian McAuley. 2024. Coral: collaborative retrieval-augmented large language models improve long-tail recommendation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3391–3401

work page 2024

[37] [37]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2024. A survey on large language models for recommendation.World Wide Web27, 5 (2024), 60

work page 2024

[38] [38]

Yu Xia, Rui Zhong, Hao Gu, Wei Yang, Chi Lu, Peng Jiang, and Kun Gai. 2025. Hierarchical tree search-based user lifelong behavior modeling on large language model. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1758–1767

work page 2025

[39] [39]

Haoran Xin, Ying Sun, Chao Wang, and Hui Xiong. 2025. Llmcdsr: Enhancing cross-domain sequential recommendation with large language models. ACM Transactions on Information Systems(2025)

work page 2025

[40] [40]

Dingqi Yang, Daqing Zhang, Vincent W Zheng, and Zhiyong Yu. 2014. Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs.IEEE Transactions on Systems, Man, and Cybernetics: Systems45, 1 (2014), 129–142

work page 2014

[41] [41]

Hongtao Zhang and Lingcheng Dai. 2018. Mobility prediction: A survey on state-of-the-art schemes and future applications.IEEE access7 (2018), 802–822

work page 2018

[42] [42]

Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2025. Recommendation as instruction following: A large language model empowered recommendation approach.ACM Transactions on Information Systems43, 5 (2025), 1–37

work page 2025

[43] [43]

Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives.ACM computing surveys (CSUR)52, 1 (2019), 1–38

work page 2019

[44] [44]

Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. 2025. CoLLM: Integrating Collaborative Embeddings Into Large Language Models for Recommendation.IEEE Transactions on Knowledge & Data Engineering01 (2025), 1–12

work page 2025

[45] [45]

Yi Zhang, Yiwen Zhang, Yu Wang, Tong Chen, and Hongzhi Yin. 2025. ProEx: A Unified Framework Leveraging Large Language Model with Profile Extrapolation for Recommendation.arXiv preprint arXiv:2512.00679(2025)

work page arXiv 2025

[46] [46]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.18223(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, et al. 2024. Recommender systems in the era of large language models (llms).IEEE Transactions on Knowledge and Data Engineering(2024)

work page 2024

[48] [48]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

work page internal anchor Pith review Pith/arXiv arXiv 2024