Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

Craig Boutilier; Eugene Ie; Heng-Tze Cheng; Jim McFadden; Jing Wang; Morgane Lustman; Paul Covington; Ritesh Agarwal; Rui Wu; Sanmit Narvekar

arxiv: 1905.12767 · v2 · pith:OC64QK3Gnew · submitted 2019-05-29 · 💻 cs.LG · cs.AI· cs.IR· stat.ML

Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

Eugene Ie , Vihan Jain , Jing Wang , Sanmit Narvekar , Ritesh Agarwal , Rui Wu , Heng-Tze Cheng , Morgane Lustman

show 5 more authors

Vince Gatto Paul Covington Jim McFadden Tushar Chandra Craig Boutilier

This is my paper

classification 💻 cs.LG cs.AIcs.IRstat.ML

keywords userlong-termmethodsrecommendationsrecommendertractablebehaviorchoice

0 comments

read the original abstract

Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behavior. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items - which may have interacting effects on user choice - methods are required to deal with the combinatorics of the RL action space. In this work, we address the challenge of making slate-based recommendations to optimize long-term value using RL. Our contributions are three-fold. (i) We develop SLATEQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates. Under mild assumptions on user choice behavior, we show that the long-term value (LTV) of a slate can be decomposed into a tractable function of its component item-wise LTVs. (ii) We outline a methodology that leverages existing myopic learning-based recommenders to quickly develop a recommender that handles LTV. (iii) We demonstrate our methods in simulation, and validate the scalability of decomposed TD-learning using SLATEQ in live experiments on YouTube.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Insider Attacks in Multi-Agent LLM Consensus Systems
cs.MA 2026-05 unverdicted novelty 5.0

A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
Time-Constrained Recommendations: Reinforcement Learning Strategies for E-Commerce
cs.LG 2025-12 unverdicted novelty 4.0

Reinforcement learning policies for time-constrained slate recommendations improve engagement over contextual bandits in e-commerce settings.