arxiv: 2604.07851 · v1 · submitted 2026-04-09 · 💻 cs.IR · cs.AI

Recognition: unknown

ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

Jiani Huang , Shijie Wang , Liangbo Ning , Wenqi Fan , Qing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords LLM recommendationreinforcement fine-tuningreasoning augmentationreward shapingmulti-step reasoningcurriculum learningrecommendation systems

0 comments

The pith

ReRec uses reinforcement fine-tuning with graph-based rewards and step-wise penalties to strengthen multi-step reasoning in LLM recommendation assistants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReRec as a reinforcement fine-tuning framework that equips large language models with stronger multi-step reasoning for handling complex, personalized recommendation queries. It tackles the limitation that standard LLM recommenders often struggle to break down user needs into logical sequences of retrieval, preference matching, and ranking. Three components drive the approach: dual-graph reward signals that blend accuracy metrics with alignment scores, an advantage estimator that scores individual reasoning steps and subtracts credit for errors, and a scheduler that ramps up training difficulty based on query complexity. Experiments show gains over prior methods on recommendation benchmarks while the models retain instruction following and broad knowledge.

Core claim

ReRec is a reinforcement fine-tuning framework that augments LLMs for recommendation by integrating Dual-Graph Enhanced Reward Shaping, which fuses NDCG@K with Query Alignment and Preference Alignment Scores, Reasoning-aware Advantage Estimation, which decomposes outputs into segments and applies penalties to flawed steps, and an Online Curriculum Scheduler that assesses query difficulty to stabilize training; the result is LLMs that deliver more accurate, reasoning-driven recommendations on complex tasks.

What carries the argument

Dual-Graph Enhanced Reward Shaping paired with Reasoning-aware Advantage Estimation inside the reinforcement fine-tuning loop, supplying fine-grained, step-level feedback that guides the model toward correct reasoning chains for recommendations.

If this is right

ReRec-trained models outperform prior LLM-based and traditional recommendation baselines on standard metrics.
The approach maintains the model's original instruction-following and general-knowledge performance after fine-tuning.
Training stability improves through dynamic difficulty assessment and curriculum ordering.
The framework produces recommendations accompanied by explicit, verifiable reasoning traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar step-wise advantage estimation could be applied to other LLM tasks that require chained inference, such as multi-hop question answering or tool-use planning.
The dual-graph construction for rewards may generalize to any domain where both accuracy and alignment with user intent matter.
If the reasoning gains hold under distribution shift, ReRec-style training could reduce the need for heavy prompt engineering in production recommenders.

Load-bearing premise

The reward signals and advantage estimates actually teach transferable reasoning rather than just optimizing the model to the exact training rewards.

What would settle it

A controlled test in which ReRec models show no improvement over standard fine-tuned LLMs on a held-out set of queries that demand novel multi-step reasoning chains not seen in training would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.07851 by Jiani Huang, Liangbo Ning, Qing Li, Shijie Wang, Wenqi Fan.

**Figure 1.** Figure 1: Example of Reasoning-Augmented LLMbased Recommendation Assistant. et al., 2015; Fan et al., 2022), but struggle to process natural language queries that reflect current preferences. As a result, they fail to meet the demand for intelligent recommendation assistants. The advent of large language models (LLMs) has unlocked new possibilities for intelligent, interactive recommendation assistants (Zhao et … view at source ↗

**Figure 2.** Figure 2: The overall model architecture of the proposed ReRec. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance on personalized recommendation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Knowledge and Capability Retention key dimensions relevant to interactive recommendation: Reasoning, Instruction-Following, Multiple Choice Question (MCQ) answering, and Knowledge. Details on the SFT procedure and evaluation benchmarks are given in Appendix D.4. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation of ReRec [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of wpenalty of RAAE In this section, we further explored the influence of the penalty parameter wpenalty for incorrect reasoning steps in RAEE on model performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

With the rise of LLMs, there is an increasing need for intelligent recommendation assistants that can handle complex queries and provide personalized, reasoning-driven recommendations. LLM-based recommenders show potential but face challenges in multi-step reasoning, underscoring the need for reasoning-augmented systems. To address this gap, we propose ReRec, a novel reinforcement fine-tuning (RFT) framework designed to improve LLM reasoning in complex recommendation tasks. Our framework introduces three key components: (1) Dual-Graph Enhanced Reward Shaping, integrating recommendation metrics like NDCG@K with Query Alignment and Preference Alignment Scores to provide fine-grained reward signals for LLM optimization; (2) Reasoning-aware Advantage Estimation, which decomposes LLM outputs into reasoning segments and penalizes incorrect steps to enhance reasoning of recommendation; and (3) Online Curriculum Scheduler, dynamically assess query difficulty and organize training curriculum to ensure stable learning during RFT. Experiments demonstrate that ReRec outperforms state-of-the-art baselines and preserves core abilities like instruction-following and general knowledge. Our codes are available at https://github.com/jiani-huang/ReRec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReRec gives a concrete RFT recipe with dual-graph rewards, segmented advantage estimation, and curriculum scheduling for LLM recommenders, but the abstract supplies no datasets, ablations, or controls to show the gains are real reasoning improvements rather than reward fitting.

read the letter

ReRec puts forward a reinforcement fine-tuning setup aimed at making LLMs better at multi-step reasoning for complex recommendation queries. The three pieces are dual-graph reward shaping that folds NDCG@K together with query and preference alignment scores, reasoning-aware advantage estimation that splits outputs into segments and applies per-step penalties, and an online curriculum scheduler that adjusts training order by query difficulty. They report that the resulting model beats baselines while holding onto instruction following and general knowledge, and they release the code on GitHub. That combination of targeted rewards and training structure is the main new element here, and it is a reasonable attempt to adapt RFT to recommendation-specific needs rather than generic alignment. The code release is the clearest practical plus, since it lets others test the components directly. The soft spot is the missing experimental backbone. The abstract gives no dataset names, no baseline list, no ablation results, no error bars, and no mention of out-of-distribution queries or controls that would separate genuine reasoning gains from simple optimization to the exact reward signals. The stress-test note on reward fitting is on point given what is shown; without those checks it is impossible to know whether the claimed outperformance reflects better reasoning or just tighter fitting to the training signals. This paper is for people already working on LLM-based recommenders who want a worked example of RFT with recsys metrics. A reader who needs ideas for reward design or curriculum handling could pull useful pieces from it, but anyone wanting to judge the strength of the reasoning claim would have to wait for the full experiments. I would send it to peer review because the framework is specific enough and the code is public, so referees can check the details and ask for the missing controls.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ReRec, a reinforcement fine-tuning (RFT) framework for LLM-based recommendation assistants handling complex queries. It introduces three components: (1) Dual-Graph Enhanced Reward Shaping that integrates NDCG@K with Query Alignment and Preference Alignment scores, (2) Reasoning-aware Advantage Estimation that decomposes outputs into reasoning segments and applies per-step penalties, and (3) an Online Curriculum Scheduler that assesses query difficulty to organize training. The central claim is that these yield improved multi-step reasoning, outperformance over state-of-the-art baselines on recommendation tasks, and preservation of core LLM abilities such as instruction-following and general knowledge, with code released at https://github.com/jiani-huang/ReRec.

Significance. If the empirical results prove robust under proper controls, the work could advance reasoning-augmented LLM recommenders by providing a structured RFT approach for complex, multi-step queries. The public code release supports reproducibility, which is a clear strength.

major comments (2)

[Experiments] Experiments section: The central outperformance claim requires detailed reporting of datasets, baseline implementations, number of runs, statistical tests, and error bars to be evaluable; the abstract supplies none of these, and without them the evidence for genuine gains over baselines cannot be assessed.
[§3] Reasoning-aware Advantage Estimation and Dual-Graph Enhanced Reward Shaping (described in §3): these tie rewards directly to NDCG@K and alignment metrics used in training; without ablations isolating each component or OOD query tests, it remains unclear whether improvements reflect robust multi-step reasoning or exploitation of the exact reward signals and curriculum schedule.

minor comments (1)

[Abstract] Abstract: 'dynamically assess' should be 'dynamically assesses' for subject-verb agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: The central outperformance claim requires detailed reporting of datasets, baseline implementations, number of runs, statistical tests, and error bars to be evaluable; the abstract supplies none of these, and without them the evidence for genuine gains over baselines cannot be assessed.

Authors: We agree that comprehensive experimental details are essential for assessing the robustness of the outperformance claims. Although the abstract is concise by design, the Experiments section in the current manuscript provides some dataset and baseline information. In the revised manuscript, we will expand this section to explicitly detail the datasets, baseline implementation specifics (including any adaptations from original papers), the number of independent runs, results of statistical significance tests (e.g., paired t-tests), and error bars on all reported metrics. These additions will enable direct evaluation of the gains over baselines. revision: yes
Referee: [§3] Reasoning-aware Advantage Estimation and Dual-Graph Enhanced Reward Shaping (described in §3): these tie rewards directly to NDCG@K and alignment metrics used in training; without ablations isolating each component or OOD query tests, it remains unclear whether improvements reflect robust multi-step reasoning or exploitation of the exact reward signals and curriculum schedule.

Authors: We acknowledge the valid concern that tying rewards to NDCG@K and alignment metrics could risk exploitation rather than genuine reasoning improvements. The Reasoning-aware Advantage Estimation is intended to mitigate this by decomposing outputs and applying per-step penalties to encourage correct intermediate reasoning, while Dual-Graph Enhanced Reward Shaping provides complementary signals. To address this directly, we will add ablation studies that systematically isolate each component (Dual-Graph Reward Shaping, Reasoning-aware Advantage Estimation, and Online Curriculum Scheduler) and report their individual contributions. We will also include evaluations on out-of-distribution (OOD) queries to demonstrate generalization beyond the training reward signals and curriculum schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an RFT framework whose three components (Dual-Graph Enhanced Reward Shaping using NDCG@K plus alignment scores, Reasoning-aware Advantage Estimation via output segmentation, and Online Curriculum Scheduler) are defined directly from standard recommendation metrics and output decomposition. No equations, self-citations, or uniqueness claims are present in the supplied text that would make any claimed prediction or improvement equivalent to the inputs by construction. The outperformance statement rests on experimental comparison to baselines, which constitutes independent empirical content rather than a fitted renaming or self-referential loop. This is the normal case of a method paper whose central claims remain falsifiable outside the training signals themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities. The framework builds on standard reinforcement learning concepts but introduces new named components whose internal implementation details and any associated parameters are not described.

pith-pipeline@v0.9.0 · 5498 in / 1063 out tokens · 66684 ms · 2026-05-10T17:24:51.923002+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAGER: Self-Evolving User Policy Skills for Recommendation Agent
cs.IR 2026-04 unverdicted novelty 7.0

SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...

Reference graph

Works this paper leans on

65 extracted references · 35 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740

work page internal anchor Pith review arXiv 2024
[4]

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep reinforcement learning: A brief survey. IEEE signal processing magazine, 34(6):26--38

2017
[5]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM conference on recommender systems, pages 1007--1014

2023
[6]

Banghao Chen, Zhaofeng Zhang, Nicolas Langren \'e , and Shengxin Zhu. 2025. Unleashing the potential of prompt engineering for large language models. Patterns

2025
[7]

Li Chen, Guanliang Chen, and Feng Wang. 2015. Recommender systems based on user reviews: the state of the art. User Modeling and User-Adapted Interaction, 25(2):99--154

2015
[8]

Sanjiban Choudhury. 2025. Process reward models for llm agents: Practical framework and directions. arXiv preprint arXiv:2502.10325

work page arXiv 2025
[9]

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161

work page internal anchor Pith review arXiv 2025
[10]

Wenqi Fan, Xiaorui Liu, Wei Jin, Xiangyu Zhao, Jiliang Tang, and Qing Li. 2022. Graph trend filtering networks for recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112--121

2022
[11]

Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. In The world wide web conference, pages 417--426

2019
[12]

Wenqi Fan, Yao Ma, Qing Li, Jianping Wang, Guoyong Cai, Jiliang Tang, and Dawei Yin. 2020. A graph neural network framework for social recommendations. IEEE Transactions on Knowledge and Data Engineering

2020
[13]

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. 2025. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776

work page internal anchor Pith review arXiv 2025
[14]

Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai, and Fei Sun. 2023. A large language model enhanced conversational recommender system. arXiv preprint arXiv:2308.06212

work page arXiv 2023
[15]

Luke Friedman, Sameer Ahuja, David Allen, Zhenning Tan, Hakim Sidahmed, Changbo Long, Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, and 1 others. 2023. Leveraging large language models in conversational recommender systems. arXiv preprint arXiv:2305.07961

work page arXiv 2023
[16]

Asela Gunawardana, Guy Shani, and Sivan Yogev. 2012. Evaluating recommender systems. In Recommender systems handbook, pages 547--601. Springer

2012
[17]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639--648

2020
[19]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173--182

2017
[20]

Bal \'a zs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939

work page internal anchor Pith review arXiv 2015
[21]

Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262

work page internal anchor Pith review arXiv 2025
[22]

Jiani Huang, Shijie Wang, Liang-bo Ning, Wenqi Fan, Shuaiqiang Wang, Dawei Yin, and Qing Li. 2025 a . Towards next-generation recommender systems: A benchmark for personalized recommendation assistant with llms. arXiv preprint arXiv:2503.09382

work page arXiv 2025
[23]

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025 b . Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749

work page internal anchor Pith review arXiv 2025
[24]

Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. 2025 c . Recommender ai agent: Integrating large language models for interactive recommendations. ACM Transactions on Information Systems, 43(4):1--33

2025
[25]

a rvelin and Jaana Kek \

Kalervo J \"a rvelin and Jaana Kek \"a l \"a inen. 2002. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422--446

2002
[26]

Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, and Xiao-Yong Wei. 2025. Self-paced learning for images of antinuclear antibodies. IEEE Transactions on Medical Imaging

2025
[27]

Yiyang Jiang, Wengyu Zhang, Xulu Zhang, Xiao-Yong Wei, Chang Wen Chen, and Qing Li. 2024. Prior knowledge integration via llm encoding and pseudo event regulation for video moment retrieval. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7249--7258

2024
[28]

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516

work page internal anchor Pith review arXiv 2025
[29]

Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pages 197--206. IEEE

2018
[30]

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, and 1 others. 2025. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems. arXiv preprint arXiv:2504.09037

work page arXiv 2025
[31]

Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. 2025. Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics. arXiv preprint arXiv:2506.00070

work page arXiv 2025
[32]

Tingting Liang, Chenxin Jin, Lingzhi Wang, Wenqi Fan, Congying Xia, Kai Chen, and Yuyu Yin. 2024. Llm-redial: a large-scale dataset for conversational recommender systems created from user behaviors with llms. In Findings of the Association for Computational Linguistics ACL 2024, pages 8926--8939

2024
[33]

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458

work page internal anchor Pith review arXiv 2025
[34]

Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Christopher Leung, Jiajie Tang, and Jiebo Luo. 2023. Llm-rec: Personalized recommendation via prompting large language models. arXiv preprint arXiv:2307.15780

work page arXiv 2023
[35]

Deepanshu Mehta. 2020. State-of-the-art reinforcement learning algorithms. International Journal of Engineering Research and Technology, 8(1):717--722

2020
[36]

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196

work page internal anchor Pith review arXiv 2024
[37]

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181):1--50

2020
[38]

Sanmit Narvekar and Peter Stone. 2018. Learning curriculum policies for reinforcement learning. arXiv preprint arXiv:1812.00285

work page arXiv 2018
[39]

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. 2025. Optimizing test-time compute via meta reinforcement fine-tuning. arXiv preprint arXiv:2503.07572

work page arXiv 2025
[40]

Xuhui Ren, Tong Chen, Quoc Viet Hung Nguyen, Lizhen Cui, Zi Huang, and Hongzhi Yin. 2024. Explicit knowledge graph reasoning for conversational recommendation. ACM Transactions on Intelligent Systems and Technology, 15(4):1--21

2024
[41]

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. 2025. Omni-r1: Do you really need audio to fine-tune your audio llm? arXiv preprint arXiv:2505.09439

work page arXiv 2025
[42]

Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285--295

2001
[43]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Guangsi Shi, Xiaofeng Deng, Linhao Luo, Lijuan Xia, Lei Bao, Bei Ye, Fei Du, Shirui Pan, and Yuxiao Li. 2024. Llm-powered explanations: Unraveling recommendations through subgraph reasoning. arXiv preprint arXiv:2406.15859

work page arXiv 2024
[46]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599

work page internal anchor Pith review arXiv 2025
[47]

Qwen Team. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Alicia Y Tsai, Adam Kraft, Long Jin, Chenwei Cai, Anahita Hosseini, Taibai Xu, Zemin Zhang, Lichan Hong, Ed H Chi, and Xinyang Yi. 2024. Leveraging llm reasoning enhances personalized recommender systems. arXiv preprint arXiv:2408.00802

work page arXiv 2024
[49]

Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, and Cihang Xie. 2025. Vilbench: A suite for vision-language process reward modeling. arXiv preprint arXiv:2503.20271

work page arXiv 2025
[50]

Jie Wang, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M Jose. 2025 a . Large language model driven policy exploration for recommender systems. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, pages 107--116

2025
[51]

Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Lionel M Ni, and 1 others. 2024 a . Openr: An open source framework for advanced reasoning with large language models. arXiv preprint arXiv:2410.09671

work page arXiv 2024
[52]

Qixin Wang, Dawei Wang, Kun Chen, Yaowei Hu, Puneet Girdhar, Ruoteng Wang, Aadesh Gupta, Chaitanya Devella, Wenlai Guo, Shangwen Huang, and 1 others. 2025 b . Adaptjobrec: Enhancing conversational career recommendation through an llm-powered agentic system. arXiv preprint arXiv:2508.13423

work page arXiv 2025
[53]

Shijie Wang, Wenqi Fan, Yue Feng, Shanru Lin, Xinyu Ma, Shuaiqiang Wang, and Dawei Yin. 2025 c . Knowledge graph retrieval-augmented generation for llm-based recommendation. arXiv preprint arXiv:2501.02226

work page arXiv 2025
[54]

Shuhe Wang, Shengyu Zhang, Jie Zhang, Runyi Hu, Xiaoya Li, Tianwei Zhang, Jiwei Li, Fei Wu, Guoyin Wang, and Eduard Hovy. 2024 b . Reinforcement learning enhanced llms: A survey. arXiv preprint arXiv:2412.10400

work page arXiv 2024
[55]

Ziyan Wang, Yingpeng Du, Zhu Sun, Haoyan Chua, Kaidong Feng, Wenya Wang, and Jie Zhang. 2025 d . Re2llm: reflective reinforcement large language model for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 12827--12835

2025
[56]

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, and 1 others. 2025. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. arXiv preprint arXiv:2505.16421

work page arXiv 2025
[57]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229--256

1992
[58]

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. 2025. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768

work page arXiv 2025
[59]

Dayu Yang, Fumian Chen, and Hui Fang. 2024. Behavior alignment: A new perspective of evaluating llm-based conversational recommendation systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2286--2290

2024
[60]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

An Zhang, Yuxin Chen, Leheng Sheng, Xiang Wang, and Tat-Seng Chua. 2024. On generative agents in recommendation. In Proceedings of the 47th international ACM SIGIR conference on research and development in Information Retrieval, pages 1807--1817

2024
[62]

Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, and 1 others. 2024. Recommender systems in the era of large language models (llms). IEEE Transactions on Knowledge and Data Engineering, 36(11):6889--6907

2024
[63]

Lixi Zhu, Xiaowen Huang, and Jitao Sang. 2025 a . A llm-based controllable, scalable, human-involved user simulator framework for conversational recommender systems. In Proceedings of the ACM on Web Conference 2025, pages 4653--4661

2025
[64]

Yaochen Zhu, Chao Wan, Harald Steck, Dawen Liang, Yesu Feng, Nathan Kallus, and Jundong Li. 2025 b . Collaborative retrieval for large language model-based conversational recommender systems. In Proceedings of the ACM on Web Conference 2025, pages 3323--3334

2025
[65]

Xingchen Zou, Yuhao Yang, Zheng Chen, Xixuan Hao, Yiqi Chen, Chao Huang, and Yuxuan Liang. 2025. Traffic-r1: Reinforced llms bring human-like reasoning to traffic signal control systems. arXiv preprint arXiv:2508.02344

work page arXiv 2025