SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

Chen Chen; Guanghui Min; Jundong Li; Liangjie Hong; Liang Wu; Yaochen Zhu; Zaiyi Zheng

arxiv: 2605.17648 · v1 · pith:TA6ZH6EBnew · submitted 2026-05-17 · 💻 cs.AI

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

Zaiyi Zheng , Guanghui Min , Yaochen Zhu , Liang Wu , Liangjie Hong , Chen Chen , Jundong Li This is my paper

Pith reviewed 2026-05-20 12:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords generative recommendationreinforcement learningpolicy optimizationcredit assignmentsemantic identifiersreasoning tracesstep-aligned optimization

0 comments

The pith

SAPO assigns separate group-relative advantages to each reasoning step to fix credit assignment in RL for generative recommendation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative recommendation generates semantic identifier sequences for items together with reasoning traces and trains them with reinforcement learning that uses only exact-match outcome rewards. This sparse signal cannot tell which specific reasoning step or token caused an error when the final prediction misses, so the policy receives noisy updates. SAPO computes a distinct advantage for every reasoning step—one thinking block paired with its SID token—and applies that advantage only inside the corresponding segment. Experiments across three real-world datasets show more stable training and higher recommendation metrics than prior generative baselines, with the biggest lifts precisely where feedback sparsity makes step-level assignment valuable. If the approach holds, it points toward RL objectives that respect the autoregressive decoder's own decomposition of structured outputs.

Core claim

Generative recommendation encodes items as semantic identifiers (SIDs) that are short coarse-to-fine token sequences and augments next-item prediction with explicit reasoning traces. These traces are optimized by reinforcement learning that supplies only an exact-match outcome reward on the final generated SID. Because the reward reports only whether the entire item is correct, any mismatch penalizes correct token positions together with the erroneous one and leaves the model without a signal for which reasoning step caused the failure. SAPO replaces the single broadcast advantage with a separate group-relative advantage computed for each reasoning step and applies that advantage exclusively

What carries the argument

Step-Aligned Policy Optimization (SAPO) that computes a distinct group-relative advantage for each reasoning step (one thinking block paired with one SID token) and applies the advantage only to the tokens inside that step.

If this is right

Reinforcement-learning training for generative recommendation becomes more stable across runs.
Recommendation accuracy rises consistently over existing generative baselines on real-world datasets.
The largest improvements appear in regimes where exact-match feedback is sparse and step-level credit matters most.
RL objectives for structured generation should be designed to mirror the decoder's hierarchical decomposition of the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-step alignment could be tested in other chain-of-thought generation settings such as mathematical reasoning or code synthesis where sparse final-answer rewards are common.
Step alignment may lower the engineering cost of creating dense intermediate rewards by extracting more signal from outcome-only supervision.
If the decoder decomposition changes (for example with different SID vocabularies), the advantage grouping would need to be redefined accordingly.

Load-bearing premise

That the natural unit of credit assignment is a single reasoning step consisting of one thinking block and one SID token, and that a separate group-relative advantage applied only to that step correctly identifies causal contributions without introducing new optimization biases.

What would settle it

Re-running the same three-dataset experiments with identical baselines shows no gain in recommendation metrics such as recall or NDCG and no reduction in training variance when per-step advantages replace the standard outcome-reward signal.

Figures

Figures reproduced from arXiv: 2605.17648 by Chen Chen, Guanghui Min, Jundong Li, Liangjie Hong, Liang Wu, Yaochen Zhu, Zaiyi Zheng.

**Figure 2.** Figure 2: Overview of SAPO. Stage 1 aligns the language model to the SID vocabulary, Stage 2 activates level-aware reasoning, and Stage 3 applies step-aligned reinforcement learning. Under the released decoding layout, the policy generates K thinking blocks followed by the K SID tokens of the predicted item. SAPO assigns a reasoning-step match reward using SID-token correctness, attaches a small format bonus at the … view at source ↗

**Figure 3.** Figure 3: RL training dynamics on the Industrial-and-Scientific dataset. We compare SAPO and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation diagnostics: reward and gradient norm [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Case study on Industrial-and-Scientific. Both methods receive the same prompt, with SID tokens highlighted in red. Outcome GRPO follows the general user interest but misses the target, whereas SAPO identifies more discriminative evidence and predicts the correct SID. 5.5 Case Study To complement the quantitative results, we provide qualitative examples showing how SAPO shapes reasoning behavior [PITH_FULL… view at source ↗

**Figure 6.** Figure 6: Prompt template used for Stage 3 RL training. Placeholders in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: A concrete prompt instance from Office-Products. The user’s history is represented entirely as SID tuples, and the rollout generates K=3 <think>. . .</think> thinking blocks (one per codebook level, matching the decoding layout of Eq. (4)) followed by the K=3 SID tokens. In this example the rollout recovers the ground-truth target exactly (all three levels correct); see Section 5.5 and Appendix N for side-… view at source ↗

**Figure 8.** Figure 8: RL training dynamics on the Video-Games dataset. This figure follows the same diagnostic layout as [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: RL training dynamics on the Office-Products dataset. M Extended Training Dynamics Figures 8 and 9 extend the main-text diagnostic from Industrial-and-Scientific to the remaining two categories. On Video-Games, GRPO exhibits a stronger instability pattern, with reward collapse or oscillation, response-length drift, rapid KL growth, and an unstable SID match rate, whereas SAPO keeps reward and SID-match dyn… view at source ↗

read the original abstract

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAPO's per-step advantages end up identical within each trajectory under outcome rewards, so the method does not deliver the claimed step-specific credit assignment.

read the letter

The main thing to know is that SAPO computes a separate group-relative advantage for each reasoning step but ends up applying the same advantage value to every step inside one response. With an outcome reward that only checks final exact-match on the SID, the reward signal is constant across all steps in a trajectory. Normalizing relative to other samples at that position therefore gives the identical normalized advantage at every position for the same sample. Applying that value locally does not isolate which thinking block or token caused the mismatch any better than broadcasting one advantage across the whole response.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SAPO (Step-Aligned Policy Optimization) for reasoning-based generative recommendation. Generative recommendation is framed as autoregressive generation of semantic identifiers (SIDs) augmented with reasoning traces. The core contribution is replacing trajectory-level advantage broadcasting with per-reasoning-step group-relative advantages, where each advantage is computed separately and applied only to the paired thinking block and SID token. This is motivated by the limitations of sparse exact-match outcome rewards in identifying which reasoning step caused a mismatch. Experiments across three real-world recommendation datasets report stabilized RL training and consistent gains over generative recommendation baselines, with larger improvements in settings where step-level credit assignment matters.

Significance. If the per-step advantage mechanism delivers non-equivalent credit assignment and the reported gains hold under rigorous controls, the work could meaningfully advance RL objectives for structured autoregressive generation in recommendation. The alignment of the optimization unit with the decoder's natural decomposition (reasoning step + SID token) is a conceptually clean idea, and the empirical focus on real datasets with sparse feedback provides a practical testbed. Credit is due for targeting a concrete pain point in outcome-reward RL for long structured outputs.

major comments (1)

[Abstract and method formulation] Abstract and method description: the central claim that SAPO 'computes a separate group-relative advantage for each reasoning step' and thereby 'identifies causal contributions at the reasoning-step level without introducing new optimization biases' is not supported under the stated outcome-reward regime. The verifiable reward is final exact-match on the generated SID, which is identical for every reasoning step inside one trajectory. When group-relative normalization is performed separately per step position across sampled responses, every step within the same trajectory receives the identical normalized advantage value (derived from the same set of trajectory rewards). This is mathematically equivalent to broadcasting a single trajectory-level advantage, directly contradicting the stated benefit of step-specific credit assignment.

minor comments (2)

The manuscript should include explicit details on experimental setup (number of samples per group for advantage estimation, exact baselines, statistical significance testing, and error bars) to allow verification of the reported improvements and stabilization claims.
Notation for 'reasoning step' (thinking block paired with one SID token) should be formalized with an equation or diagram early in the method section to avoid ambiguity when describing the per-step advantage application.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and insightful review of our manuscript. The major comment raises a substantive point about the mathematical properties of the proposed advantage computation under outcome rewards. We address it directly below.

read point-by-point responses

Referee: [Abstract and method formulation] Abstract and method description: the central claim that SAPO 'computes a separate group-relative advantage for each reasoning step' and thereby 'identifies causal contributions at the reasoning-step level without introducing new optimization biases' is not supported under the stated outcome-reward regime. The verifiable reward is final exact-match on the generated SID, which is identical for every reasoning step inside one trajectory. When group-relative normalization is performed separately per step position across sampled responses, every step within the same trajectory receives the identical normalized advantage value (derived from the same set of trajectory rewards). This is mathematically equivalent to broadcasting a single trajectory-level advantage, directly contradicting the stated benefit of step-specific credit assignment.

Authors: We thank the referee for this precise analysis. We agree that, because the outcome reward is defined at the full-trajectory level (exact match on the final SID), the set of rewards available for normalization is identical across every reasoning-step position. Consequently, the normalized advantage assigned to every step within a given trajectory is the same value. This renders the per-step computation mathematically equivalent to trajectory-level advantage broadcasting; it does not differentiate credit among steps on the basis of their individual causal contributions to the final reward. We acknowledge that the original wording in the abstract and method sections overstated the degree of step-specific credit assignment. We will revise both sections to remove the phrasing that SAPO 'identifies causal contributions at the reasoning-step level' and to describe the method more accurately as localizing the application of a trajectory-level advantage to the tokens of the corresponding reasoning block and SID token. We will also update the discussion of motivation to reflect that the primary observed benefits are empirical (training stability and performance gains) rather than a theoretical resolution of intra-trajectory credit assignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central derivation introduces SAPO by identifying the reasoning step as the natural credit-assignment unit and instantiating per-step group-relative advantage computation applied only to the paired thinking block and SID token. This construction does not reduce by the paper's own description to a quantity fitted from prior outcome-reward objectives, nor does it rely on self-citations for load-bearing uniqueness theorems or ansatzes. The method is presented as a direct alignment of the RL objective with the decoder's autoregressive decomposition, retaining independent content beyond any prior baselines or standard trajectory-level advantages. No equations or claims in the provided text exhibit the specific reductions required for circularity flags under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that reasoning steps form the appropriate granularity for credit assignment and on standard RL concepts such as group-relative advantage; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token).
Explicitly identified in the abstract as the key insight motivating the method.

pith-pipeline@v0.9.0 · 5791 in / 1250 out tokens · 51521 ms · 2026-05-20T12:10:01.757299+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token (Section 4.3, Eq. 8-9)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 11 internal anchors

[1]

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Reasoning over semantic ids enhances generative recommenda- tion.arXiv preprint arXiv:2603.23183, 2026

Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xiang Wang, An Zhang, and Tat-Seng Chua. Reasoning over semantic ids enhances generative recommenda- tion.arXiv preprint arXiv:2603.23183, 2026

work page arXiv 2026
[4]

Session-based Recommendations with Recurrent Neural Networks

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE, 2018

work page 2018
[6]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

work page 2022
[8]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[9]

Rec-r1: Bridging generative large language models and user-centric recommendation systems via reinforcement learning.Transactions on Machine Learning Research, 2025

Jiacheng Lin, Tian Wang, and Kun Qian. Rec-r1: Bridging generative large language models and user-centric recommendation systems via reinforcement learning.Transactions on Machine Learning Research, 2025

work page 2025
[10]

Onerec-think: In-text reasoning for generative recommendation.arXiv preprint arXiv:2510.11639, 2025

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, et al. Onerec-think: In-text reasoning for generative recommendation.arXiv preprint arXiv:2510.11639, 2025

work page arXiv 2025
[11]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly- labeled reviews and fine-grained aspects. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197, 2019

work page 2019
[14]

Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

work page 2023
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[17]

Think before recommend: Unleashing the latent reasoning power for sequential recommendation

Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Jian Wu, and Yuning Jiang. Think before recommend: Unleashing the latent reasoning power for sequential recommendation. arXiv preprint arXiv:2503.22675, 2025

work page arXiv 2025
[18]

Personalized top-n sequential recommendation via convolutional sequence embedding

Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining, pages 565–573, 2018

work page 2018
[19]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Learnable item tokenization for generative recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

work page 2024
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Reccot: Enhancing recommendation via chain-of-thought.arXiv preprint arXiv:2506.21032, 2025

Shuo Yang, Jiangxia Cao, Haipeng Li, Yuqi Mao, and Shuchao Pang. Reccot: Enhancing recommendation via chain-of-thought.arXiv preprint arXiv:2506.21032, 2025

work page arXiv 2025
[23]

R2ec: Towards large recommender models with reasoning

Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, and Liqiang Nie. R2ec: Towards large recommender models with reasoning. InThe 39th Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations. InInternational Conference on Machine Learning, pages 58484–58509. PMLR, 2024

work page 2024
[26]

Why thinking hurts? diagnosing and rectifying the reasoning shift in foundation recommender models.arXiv preprint arXiv:2602.16587, 2026

Luankang Zhang, Yonghao Huang, Hang Lv, Mingjia Yin, Liangyue Li, Zulong Chen, Hao Wang, and Enhong Chen. Why thinking hurts? diagnosing and rectifying the reasoning shift in foundation recommender models.arXiv preprint arXiv:2602.16587, 2026

work page arXiv 2026
[27]

Adapting large language models by integrating collaborative semantics for recommen- dation

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommen- dation. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 1435–1448. IEEE, 2024

work page 2024
[28]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

KX k=1 mk # =α KX k=1 Pr x,y∼π(·|x) h s(k) =s gt,(k) i =α KX k=1 Rk(π).(28) In contrast, Jout(π) = Pr x,y∼π(·|x)

Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, and Nathan Kallus. Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning.arXiv preprint arXiv:2510.20150, 2025. 11 Appendix A Notation and Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B ...

work page arXiv 2025

[1] [1]

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. Onerec: Unifying retrieve and rank with generative recommender and iterative preference alignment.arXiv preprint arXiv:2502.18965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Reasoning over semantic ids enhances generative recommenda- tion.arXiv preprint arXiv:2603.23183, 2026

Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xiang Wang, An Zhang, and Tat-Seng Chua. Reasoning over semantic ids enhances generative recommenda- tion.arXiv preprint arXiv:2603.23183, 2026

work page arXiv 2026

[4] [4]

Session-based Recommendations with Recurrent Neural Networks

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE, 2018

work page 2018

[6] [6]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022

work page 2022

[8] [8]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[9] [9]

Rec-r1: Bridging generative large language models and user-centric recommendation systems via reinforcement learning.Transactions on Machine Learning Research, 2025

Jiacheng Lin, Tian Wang, and Kun Qian. Rec-r1: Bridging generative large language models and user-centric recommendation systems via reinforcement learning.Transactions on Machine Learning Research, 2025

work page 2025

[10] [10]

Onerec-think: In-text reasoning for generative recommendation.arXiv preprint arXiv:2510.11639, 2025

Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, et al. Onerec-think: In-text reasoning for generative recommendation.arXiv preprint arXiv:2510.11639, 2025

work page arXiv 2025

[11] [11]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly- labeled reviews and fine-grained aspects. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 188–197, 2019

work page 2019

[14] [14]

Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Tran, Jonah Samost, et al. Recommender systems with generative retrieval.Advances in Neural Information Processing Systems, 36:10299–10315, 2023

work page 2023

[15] [15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the 20th European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025

[17] [17]

Think before recommend: Unleashing the latent reasoning power for sequential recommendation

Jiakai Tang, Sunhao Dai, Teng Shi, Jun Xu, Xu Chen, Wen Chen, Jian Wu, and Yuning Jiang. Think before recommend: Unleashing the latent reasoning power for sequential recommendation. arXiv preprint arXiv:2503.22675, 2025

work page arXiv 2025

[18] [18]

Personalized top-n sequential recommendation via convolutional sequence embedding

Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. InProceedings of the eleventh ACM international conference on web search and data mining, pages 565–573, 2018

work page 2018

[19] [19]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Learnable item tokenization for generative recommendation

Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2400–2409, 2024

work page 2024

[21] [21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Reccot: Enhancing recommendation via chain-of-thought.arXiv preprint arXiv:2506.21032, 2025

Shuo Yang, Jiangxia Cao, Haipeng Li, Yuqi Mao, and Shuchao Pang. Reccot: Enhancing recommendation via chain-of-thought.arXiv preprint arXiv:2506.21032, 2025

work page arXiv 2025

[23] [23]

R2ec: Towards large recommender models with reasoning

Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, and Liqiang Nie. R2ec: Towards large recommender models with reasoning. InThe 39th Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[24] [24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations

Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, et al. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations. InInternational Conference on Machine Learning, pages 58484–58509. PMLR, 2024

work page 2024

[26] [26]

Why thinking hurts? diagnosing and rectifying the reasoning shift in foundation recommender models.arXiv preprint arXiv:2602.16587, 2026

Luankang Zhang, Yonghao Huang, Hang Lv, Mingjia Yin, Liangyue Li, Zulong Chen, Hao Wang, and Enhong Chen. Why thinking hurts? diagnosing and rectifying the reasoning shift in foundation recommender models.arXiv preprint arXiv:2602.16587, 2026

work page arXiv 2026

[27] [27]

Adapting large language models by integrating collaborative semantics for recommen- dation

Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommen- dation. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 1435–1448. IEEE, 2024

work page 2024

[28] [28]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

KX k=1 mk # =α KX k=1 Pr x,y∼π(·|x) h s(k) =s gt,(k) i =α KX k=1 Rk(π).(28) In contrast, Jout(π) = Pr x,y∼π(·|x)

Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Vito Ostuni, Jundong Li, and Nathan Kallus. Rank-grpo: Training llm-based conversational recommender systems with reinforcement learning.arXiv preprint arXiv:2510.20150, 2025. 11 Appendix A Notation and Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B ...

work page arXiv 2025