Generative Reasoning Re-ranker

Chonglin Sun; Fei Tian; Frank Shyu; Hamed Firooz; Jacob Tao; Jay Xu; Jiang Liu; Kaushik Rangadurai; Kavosh Asadi; Luke Simon

arxiv: 2602.07774 · v5 · pith:3KVME2A2new · submitted 2026-02-08 · 💻 cs.IR · cs.AI

Generative Reasoning Re-ranker

Mingfu Liang , Yufei Li , Jay Xu , Kavosh Asadi , Xi Liu , Shuo Gu , Kaushik Rangadurai , Frank Shyu

show 15 more authors

Shuaiwen Wang Song Yang Zhijing Li Jiang Liu Mengying Sun Fei Tian Xiaohan Wei Chonglin Sun Jacob Tao Shike Mei Wenlin Chen Santanu Kolay Sandeep Pandey Hamed Firooz Luke Simon

This is my paper

classification 💻 cs.IR cs.AI

keywords reasoningrerankingllmsdesignedfine-tuninggenerativehigh-qualitynon-semantic

0 comments

read the original abstract

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
cs.IR 2026-05 unverdicted novelty 6.0

A Llama-based model trained on serialized user stories unifies item, carousel, and search ranking and outperforms specialist baselines offline while improving some online metrics and reducing latency.
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
cs.IR 2026-05 unverdicted novelty 5.0

TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.