Deep Reinforcement Learning for Personalized Search Story Recommendation

Dongwon Lee; Jason (Jiasheng) Zhang; Junming Yin; Linhong Zhu

arxiv: 1907.11754 · v1 · pith:RM3AJOQAnew · submitted 2019-07-26 · 💻 cs.LG · cs.IR· stat.ML

Deep Reinforcement Learning for Personalized Search Story Recommendation

Jason (Jiasheng) Zhang , Junming Yin , Dongwon Lee , Linhong Zhu This is my paper

Pith reviewed 2026-05-24 15:36 UTC · model grok-4.3

classification 💻 cs.LG cs.IRstat.ML

keywords deep reinforcement learningpersonalized recommendationsearch storyMarkov decision processimitation learninge-commerce searchcross-channel effect

0 comments

The pith

A Markov decision process with deep reinforcement learning models both immediate clicks and long-term effects of search story recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional supervised learning cannot handle the cross-channel effects of search story recommendations, which influence both immediate user actions and future search behavior. It formulates the task as a Markov decision process and introduces a deep reinforcement learning architecture trained with imitation learning and reinforcement learning to optimize for both short-term and long-term rewards. A sympathetic reader would care because search stories appear on major platforms and shape user intent across channels in e-commerce and news settings. The method is evaluated on real-world data from JD.com to show improved recommendation quality.

Core claim

By modeling personalized search story recommendation as a Markov decision process, a deep reinforcement learning architecture trained jointly by imitation learning and reinforcement learning captures the immediate and future values of each recommendation, addressing limitations of supervised methods that ignore sequential and cross-channel impacts.

What carries the argument

A deep reinforcement learning architecture inside a Markov decision process framework, trained by both imitation learning and reinforcement learning, to estimate combined immediate and future rewards of search story recommendations.

If this is right

The model jointly optimizes immediate clicks and downstream effects on user search patterns.
Imitation learning from historical logs provides an effective starting policy before reinforcement learning refinement.
Recommendations are selected to maximize cumulative reward over sequences of user interactions.
The architecture directly incorporates the cross-channel influence that a search story exerts on organic results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same MDP framing could apply to other recommendation settings where one action alters the user's future state, such as news or video feeds.
Stronger state representations that explicitly encode multi-session history might further improve the model's ability to handle non-Markovian effects.
Live A/B tests measuring changes in overall session length or repeat visit rates would provide a direct test of the long-term value modeling.

Load-bearing premise

User search and click behavior can be accurately represented as a Markov decision process in which the current state contains all information needed to predict future rewards without unmodeled history or external factors.

What would settle it

On the JD.com datasets, if a supervised learning baseline matches or exceeds the proposed model on long-term user engagement metrics such as sustained search frequency or cross-channel conversion rates, the advantage of the MDP and RL formulation would be called into question.

Figures

Figures reproduced from arXiv: 1907.11754 by Dongwon Lee, Jason (Jiasheng) Zhang, Junming Yin, Linhong Zhu.

**Figure 2.** Figure 2: The illustrative view of neural network dynamic function. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The architecture of implemented RNN dynamic model. Colors are used to distinguish different [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Network structure of reinforcement learning controller (best viewed in color). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Histograms of (a) episode length and (b) story impression frequency. Both follow a power-law [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: CT R Improvement versus different choice of evaluation horizon H. The shadow area shows the standard deviation of the multiple experiments. The points with non-integer H are interpolated for better visualization. 8.2 Empirical Results In this section, we conduct different groups of experiments to empirically validate the proposed approaches. Specifically, we aim to answer the following questions: (1) Is i… view at source ↗

read the original abstract

In recent years, \emph{search story}, a combined display with other organic channels, has become a major source of user traffic on platforms such as e-commerce search platforms, news feed platforms and web and image search platforms. The recommended search story guides a user to identify her own preference and personal intent, which subsequently influences the user's real-time and long-term search behavior. %With such an increased importance of search stories, As search stories become increasingly important, in this work, we study the problem of personalized search story recommendation within a search engine, which aims to suggest a search story relevant to both a search keyword and an individual user's interest. To address the challenge of modeling both immediate and future values of recommended search stories (i.e., cross-channel effect), for which conventional supervised learning framework is not applicable, we resort to a Markov decision process and propose a deep reinforcement learning architecture trained by both imitation learning and reinforcement learning. We empirically demonstrate the effectiveness of our proposed approach through extensive experiments on real-world data sets from JD.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies standard DRL plus imitation learning to search story recs on e-commerce platforms but supplies no metrics or baselines to support the claims.

read the letter

The core of this paper is a straightforward application of deep RL (with imitation learning) to personalized search story recommendation. It models the task as an MDP so the policy can balance immediate clicks against longer-term cross-channel effects that supervised methods ignore. That framing is reasonable for the setting and matches the practical problem on platforms like JD.com where stories influence both real-time and future search behavior. The approach itself draws on existing techniques rather than introducing new algorithms or derivations. Credit is due for identifying a concrete use case where delayed rewards matter and for trying to handle them jointly instead of optimizing only short-term metrics. The abstract states that experiments on real data show effectiveness, but it reports none of the actual numbers, baselines, ablations, or implementation details. Without those, there is no way to check whether the MDP formulation delivers measurable gains or whether the learned policy is misspecified. The stress-test concern about the Markov property is fair to raise: if user state (profile plus current keyword) leaves out session history or external factors, the value estimates will be biased, and any claimed advantage over supervised learning would shrink. The paper does not appear to test or relax that assumption. This work is aimed at applied recsys practitioners who already use RL pipelines and need to handle multi-channel traffic. A reader already familiar with DQN-style methods and imitation learning will see little new machinery. It deserves peer review only if the authors supply the missing experimental section with clear baselines and statistical tests; on the current abstract alone it would be a desk reject for lack of evidence.

Referee Report

2 major / 1 minor

Summary. The manuscript models personalized search story recommendation as a Markov decision process and introduces a deep reinforcement learning architecture trained jointly via imitation learning and reinforcement learning. The central claim is that this approach captures both immediate and long-term (cross-channel) value of recommendations, unlike conventional supervised learning, with effectiveness shown via experiments on real-world JD.com data.

Significance. If the empirical claims hold after addressing modeling assumptions, the work could contribute to recommendation systems by demonstrating how RL can optimize long-term user behavior in search platforms where recommendations influence future cross-channel activity. The use of real-world industrial data is a positive aspect for practical relevance.

major comments (2)

[Proposed approach / MDP formulation] The MDP formulation (described in the proposed approach) assumes the state—constructed from user profile and current keyword—contains all information needed to predict future rewards. No validation or discussion is provided that user search/click behavior satisfies the Markov property (e.g., independence from unmodeled session history or external factors), which is load-bearing for the claimed advantage of RL over supervised methods.
[Abstract / Experiments] Abstract and Experiments section: the claim of empirical effectiveness on real-world data supplies no metrics, baselines, ablation studies (e.g., imitation-only vs. full RL), or quantitative results, making it impossible to assess whether the data supports the central claim or to compare against supervised alternatives.

minor comments (1)

[Abstract] Abstract contains a commented-out sentence fragment (starting with '%With such an increased importance') and an awkward transition ('As search stories become increasingly important, in this work...').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and outline the revisions we will make.

read point-by-point responses

Referee: [Proposed approach / MDP formulation] The MDP formulation (described in the proposed approach) assumes the state—constructed from user profile and current keyword—contains all information needed to predict future rewards. No validation or discussion is provided that user search/click behavior satisfies the Markov property (e.g., independence from unmodeled session history or external factors), which is load-bearing for the claimed advantage of RL over supervised methods.

Authors: We agree that the manuscript would benefit from an explicit discussion of the Markov assumption. The state is constructed from user profile and current keyword following common practice in industrial recommendation systems, but we will add a dedicated paragraph in the Proposed Approach section acknowledging the assumption, noting its limitations with respect to unmodeled session history, and explaining why the chosen features make the approximation reasonable for this application. revision: yes
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of empirical effectiveness on real-world data supplies no metrics, baselines, ablation studies (e.g., imitation-only vs. full RL), or quantitative results, making it impossible to assess whether the data supports the central claim or to compare against supervised alternatives.

Authors: The abstract is written at a high level for brevity, but we accept that including key quantitative results would improve transparency. The Experiments section reports results on JD.com data with comparisons to supervised methods; however, we will revise the abstract to state the main performance metrics and expand the Experiments section to explicitly present ablation studies (imitation-only versus joint imitation+RL) and additional baselines so that the empirical claims can be directly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained with no reductions by construction

full rationale

The provided abstract and text describe an MDP formulation plus a DRL architecture trained via imitation learning and RL to model immediate and long-term values. No equations, fitted parameters, or self-citations are shown that reduce a claimed prediction or result to the inputs by definition. The central modeling choice (Markov property) is an explicit assumption rather than a derived claim that loops back on itself. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. This matches the default expectation of no significant circularity when the text contains no load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; the central modeling choice is the MDP formulation whose validity is taken as given.

axioms (1)

domain assumption User search behavior can be modeled as a Markov decision process
Explicitly stated when the authors say they resort to an MDP to capture immediate and future values.

pith-pipeline@v0.9.0 · 5716 in / 1170 out tokens · 39380 ms · 2026-05-24T15:36:12.084781+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we resort to a Markov decision process and propose a deep reinforcement learning architecture trained by both imitation learning and reinforcement learning
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the reward rt(st,at) can be quantified as the number of clicks, or the number of orders

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 11 internal anchors

[1]

Deep Reinforcement Learning for Personalized Search Story Recommendation

INTRODUCTION Imagine that a customer visits a retail shop to purchase a dress which is to her liking. As the customer walks in, a busi- ness assistant is present to assist the customer by answer- ing questions on fashion trend or suggesting related dresses. In online e-commerce applications, more business units are adding a component that plays a similar ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

RELA TED WORK In this session, we brieﬂy review two topics that are rel- evant to our work, namely reinforcement learning and rec- ommendation/ranking. 2.1 Reinforcement Learning In the general reinforcement learning framework, an agent sequentially interacts with the environment and learns to achieve the best return, which is in the form of accumulated i...

work page
[3]

Speciﬁcally, we use lower case symbols u, q, d, p to rep- resent a single user, query, story item, and an item from another channel (e.g., the product item), respectively

PROBLEM DEFINITION 3.1 Preliminary For ease of presentation, we ﬁrst introduce the list of no- tations and basic concepts used through the entire work. Speciﬁcally, we use lower case symbols u, q, d, p to rep- resent a single user, query, story item, and an item from another channel (e.g., the product item), respectively. Up- per case symbols U, Q, D, P a...

work page
[4]

on-policy

DEEP REINFORCEMENT LEARNING FOR SEARCH STORY RECOMMENDA- TION In this section, we give an overview of our deep reinforce- ment learning framework for personalized search story rec- ommendation, DRESS. Given limited oﬄine data, we pro- pose to combine both model-based augmentation and imi- tation learning with the conventional reinforcement learn- ing. Mod...

work page
[5]

As illustrated in Figure 2, our dynamic model consists of two units: a reward model MR and a transition model MT

THE NEURAL NETWORK DYNAMIC FUNCTION 5.1 Illustrative Overview As introduced earlier, we parameterize the dynamic model Mθ as a neural network function and thus θ represents the weights of neural networks. As illustrated in Figure 2, our dynamic model consists of two units: a reward model MR and a transition model MT . The transition model MT up- dates the...

work page
[6]

Speciﬁcally, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool

CONTROLLER REINFORCEMENT LEARNING Our reinforcement learning controller is designed under the traditional actor-critic architecture [3]. Speciﬁcally, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool. Figure 4 illustrates our network structure of reinforcemen...

work page
[7]

The goal of the imitation learning is thus to learn to imitate the previous controller with a ﬁxed policy π0

IMITA TION AND IMAGINA TION 7.1 Imitation Learning In our search recommendation task, and most other real- world decision-making problems (e.g., ﬁnance and health- care), we have access to the logging data of the system being operated by its previous controller, but we do not have ac- cess to an accurate simulator of the system. The goal of the imitation ...

work page
[8]

women dress

EXPERIMENTAL V ALIDA TION In this section, we conduct extensive experiments with a dataset from a real e-commerce company and evaluate the eﬀectiveness of DRESS. 8.1 Experimental Setup 8.1.1 Dataset We evaluate our methods on a dataset collected between Apr 2018 and Jul 2018 from JD.com [45]. We sampled all search sessions that are related to a category “...

work page 2018
[9]

ORIGIN: This is the state-of-the-art implementation of a search story recommendation, that results in the oﬄine data, currently being used by the company

work page
[10]

Both follow a power-law distribution

DNNC (Deep Neural Network Classiﬁer): Without considering the cross-channel eﬀect, this method is 7 10 102 103 104 10 50 100 150 200 Number of Users Length of Search Episode (a) 1 50 100 150 200 10 102 103 104 105 Number of Stories Number of Impressions in Sessions (b) Figure 5: Histograms of (a) episode length and (b) story impression frequency. Both fol...

work page
[11]

DRESS-m: This is the myopic version of DRESS that only considers immediate short-term reward, which is implemented by setting γ = 0

work page
[12]

8.1.3 Evaluation Metric The goal of a search story recommendation is to facili- tate users during the search of products

DRESS-s: This is the simpliﬁed version of DRESS with the controller imagination module (Section.7.2) removed. 8.1.3 Evaluation Metric The goal of a search story recommendation is to facili- tate users during the search of products. Therefore, we use search session based user feedback on products as the main performance measure. In particular, we use the p...

work page
[13]

Log probability ratio: rationi = log(π(ai|si) b(ai|si) ) for a ses- sion i

work page
[14]

Total variation divergence: DTV(b||π)i = 1 2 ∑ a′|π(a′|si)− b(a′|si)| [28]

work page
[15]

We calculate the averages of each diﬀerence measure over sessions in test data

KL-divergence: DKL(b||π)i = ∑ a′b(a′|si) log(b(a′|si) π(a′|si)). We calculate the averages of each diﬀerence measure over sessions in test data. We use the uniform distribution unif for comparison. Results are shown in Table.6. Compared with uniform policy unif, both DRESS and DRESS-s are close to the imitation policy. As expected, the policy ob- tained b...

work page
[16]

CONCLUSION Deep reinforcement learning has been successfully used as a powerful method to capture a wide variety of non- trivial user behavior on online platforms (e.g., news feed recommendation, e-commerce search). In this work, fol- lowing these successes, we applied the reinforcement learn- ing framework to the challenging problem of cross-channel sear...

work page
[17]

N. Abe, N. Verma, C. Apte, and R. Schroko. Cross channel optimized marketing by reinforcement learning. In SIGKDD, pages 767–772. ACM, 2004

work page 2004
[18]

A Brief Survey of Deep Reinforcement Learning

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

An Actor-Critic Algorithm for Sequence Prediction

D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Bottou, J

L. Bottou, J. Peters, J. Qui˜ nonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 14(1):3207–3260, 2013

work page 2013
[21]

H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In WSDM, pages 661–670. ACM, 2017

work page 2017
[22]

Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang. Reinforcement mechanism design for e-commerce. In WWW, pages 1339–1348, 2018

work page 2018
[23]

Covington, J

P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Recommender System, pages 191–198. ACM, 2016

work page 2016
[24]

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, pages 2829–2838, 2016

work page 2016
[25]

Guerraoui, A.-M

R. Guerraoui, A.-M. Kermarrec, T. Lin, and R. Patra. Heterogeneous recommendations: what you might like to read after watching interstellar. Proceedings of the VLDB Endowment, 10(10):1070–1081, 2017

work page 2017
[26]

Session-based Recommendations with Recurrent Neural Networks

B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Hidasi, M

B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk. Parallel recurrent neural network architectures for feature-rich session-based recommendations. 2016

work page 2016
[28]

Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. arXiv preprint arXiv:1803.00710 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative ﬁltering model. In SIGKDD, pages 426–434. ACM, 2008

work page 2008
[30]

Koren, R

Y. Koren, R. Bell, C. Volinsky, et al. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009

work page 2009
[31]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, 2016

work page 2016
[32]

J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661–670. ACM, 2010

work page 2010
[34]

T. Li, Z. Xu, J. Tang, and Y. Wang. Model-free control for distributed stream data processing using deep reinforcement learning. Proceedings of the VLDB Endowment, 11(6):705–718, 2018. 10

work page 2018
[35]

Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

Mandel, Y.-E

T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Oﬄine policy evaluation across representations with applications to educational games. In AAMAS, pages 1077–1084, 2014

work page 2014
[38]

Michels, A

J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML, pages 593–600. ACM, 2005

work page 2005
[39]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016

work page 1928
[40]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015
[41]

Radlinski, R

F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, pages 784–791. ACM, 2008

work page 2008
[42]

Salakhutdinov, A

R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ﬁltering. In ICML, pages 791–798. ACM, 2007

work page 2007
[43]

Sarwar, G

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative ﬁltering recommendation algorithms. In WWW, pages 285–295. ACM, 2001

work page 2001
[44]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015

work page 2015
[45]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016

work page 2016
[47]

Silver, G

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014

work page 2014
[48]

R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning, pages 216–224. Elsevier, 1990

work page 1990
[49]

Y. K. Tan, X. Xu, and Y. Liu. Improved recurrent neural networks for session-based recommendations. arXiv preprint arXiv:1606.08117 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

Theocharous, P

G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI, pages 1806–1812, 2015

work page 2015
[51]

Trummer, S

I. Trummer, S. Moseley, D. Maram, S. Jo, and J. Antonakakis. Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment , 11(12):2074–2077, 2018

work page 2074
[52]

Van den Oord, S

A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, pages 2643–2651, 2013

work page 2013
[53]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016

work page 2016
[54]

H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In SIGKDD, pages 1235–1244. ACM, 2015

work page 2015
[55]

Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Watter, J

M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, pages 2746–2754, 2015

work page 2015
[57]

Weimer, A

M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola. Maximum margin matrix factorization for collaborative ranking. NIPS, pages 1–8, 2007

work page 2007
[58]

Zhang, Y

J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. Wang, T. Cheng, L. Liu, M. Ran, and Z. Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. SIDMOD, 2019

work page 2019
[59]

X. Zhao, W. Zhang, and J. Wang. Interactive collaborative ﬁltering. In CIKM, pages 1411–1420. ACM, 2013

work page 2013
[60]

Zheng, F

G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW, pages 167–176, 2018

work page 2018
[61]

L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. Reinforcement learning to optimize long-term user engagement in recommender systems, 2019. 11

work page 2019

[1] [1]

Deep Reinforcement Learning for Personalized Search Story Recommendation

INTRODUCTION Imagine that a customer visits a retail shop to purchase a dress which is to her liking. As the customer walks in, a busi- ness assistant is present to assist the customer by answer- ing questions on fashion trend or suggesting related dresses. In online e-commerce applications, more business units are adding a component that plays a similar ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

RELA TED WORK In this session, we brieﬂy review two topics that are rel- evant to our work, namely reinforcement learning and rec- ommendation/ranking. 2.1 Reinforcement Learning In the general reinforcement learning framework, an agent sequentially interacts with the environment and learns to achieve the best return, which is in the form of accumulated i...

work page

[3] [3]

Speciﬁcally, we use lower case symbols u, q, d, p to rep- resent a single user, query, story item, and an item from another channel (e.g., the product item), respectively

PROBLEM DEFINITION 3.1 Preliminary For ease of presentation, we ﬁrst introduce the list of no- tations and basic concepts used through the entire work. Speciﬁcally, we use lower case symbols u, q, d, p to rep- resent a single user, query, story item, and an item from another channel (e.g., the product item), respectively. Up- per case symbols U, Q, D, P a...

work page

[4] [4]

on-policy

DEEP REINFORCEMENT LEARNING FOR SEARCH STORY RECOMMENDA- TION In this section, we give an overview of our deep reinforce- ment learning framework for personalized search story rec- ommendation, DRESS. Given limited oﬄine data, we pro- pose to combine both model-based augmentation and imi- tation learning with the conventional reinforcement learn- ing. Mod...

work page

[5] [5]

As illustrated in Figure 2, our dynamic model consists of two units: a reward model MR and a transition model MT

THE NEURAL NETWORK DYNAMIC FUNCTION 5.1 Illustrative Overview As introduced earlier, we parameterize the dynamic model Mθ as a neural network function and thus θ represents the weights of neural networks. As illustrated in Figure 2, our dynamic model consists of two units: a reward model MR and a transition model MT . The transition model MT up- dates the...

work page

[6] [6]

Speciﬁcally, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool

CONTROLLER REINFORCEMENT LEARNING Our reinforcement learning controller is designed under the traditional actor-critic architecture [3]. Speciﬁcally, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool. Figure 4 illustrates our network structure of reinforcemen...

work page

[7] [7]

The goal of the imitation learning is thus to learn to imitate the previous controller with a ﬁxed policy π0

IMITA TION AND IMAGINA TION 7.1 Imitation Learning In our search recommendation task, and most other real- world decision-making problems (e.g., ﬁnance and health- care), we have access to the logging data of the system being operated by its previous controller, but we do not have ac- cess to an accurate simulator of the system. The goal of the imitation ...

work page

[8] [8]

women dress

EXPERIMENTAL V ALIDA TION In this section, we conduct extensive experiments with a dataset from a real e-commerce company and evaluate the eﬀectiveness of DRESS. 8.1 Experimental Setup 8.1.1 Dataset We evaluate our methods on a dataset collected between Apr 2018 and Jul 2018 from JD.com [45]. We sampled all search sessions that are related to a category “...

work page 2018

[9] [9]

ORIGIN: This is the state-of-the-art implementation of a search story recommendation, that results in the oﬄine data, currently being used by the company

work page

[10] [10]

Both follow a power-law distribution

DNNC (Deep Neural Network Classiﬁer): Without considering the cross-channel eﬀect, this method is 7 10 102 103 104 10 50 100 150 200 Number of Users Length of Search Episode (a) 1 50 100 150 200 10 102 103 104 105 Number of Stories Number of Impressions in Sessions (b) Figure 5: Histograms of (a) episode length and (b) story impression frequency. Both fol...

work page

[11] [11]

DRESS-m: This is the myopic version of DRESS that only considers immediate short-term reward, which is implemented by setting γ = 0

work page

[12] [12]

8.1.3 Evaluation Metric The goal of a search story recommendation is to facili- tate users during the search of products

DRESS-s: This is the simpliﬁed version of DRESS with the controller imagination module (Section.7.2) removed. 8.1.3 Evaluation Metric The goal of a search story recommendation is to facili- tate users during the search of products. Therefore, we use search session based user feedback on products as the main performance measure. In particular, we use the p...

work page

[13] [13]

Log probability ratio: rationi = log(π(ai|si) b(ai|si) ) for a ses- sion i

work page

[14] [14]

Total variation divergence: DTV(b||π)i = 1 2 ∑ a′|π(a′|si)− b(a′|si)| [28]

work page

[15] [15]

We calculate the averages of each diﬀerence measure over sessions in test data

KL-divergence: DKL(b||π)i = ∑ a′b(a′|si) log(b(a′|si) π(a′|si)). We calculate the averages of each diﬀerence measure over sessions in test data. We use the uniform distribution unif for comparison. Results are shown in Table.6. Compared with uniform policy unif, both DRESS and DRESS-s are close to the imitation policy. As expected, the policy ob- tained b...

work page

[16] [16]

CONCLUSION Deep reinforcement learning has been successfully used as a powerful method to capture a wide variety of non- trivial user behavior on online platforms (e.g., news feed recommendation, e-commerce search). In this work, fol- lowing these successes, we applied the reinforcement learn- ing framework to the challenging problem of cross-channel sear...

work page

[17] [17]

N. Abe, N. Verma, C. Apte, and R. Schroko. Cross channel optimized marketing by reinforcement learning. In SIGKDD, pages 767–772. ACM, 2004

work page 2004

[18] [18]

A Brief Survey of Deep Reinforcement Learning

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

An Actor-Critic Algorithm for Sequence Prediction

D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Bottou, J

L. Bottou, J. Peters, J. Qui˜ nonero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 14(1):3207–3260, 2013

work page 2013

[21] [21]

H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In WSDM, pages 661–670. ACM, 2017

work page 2017

[22] [22]

Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang. Reinforcement mechanism design for e-commerce. In WWW, pages 1339–1348, 2018

work page 2018

[23] [23]

Covington, J

P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Recommender System, pages 191–198. ACM, 2016

work page 2016

[24] [24]

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, pages 2829–2838, 2016

work page 2016

[25] [25]

Guerraoui, A.-M

R. Guerraoui, A.-M. Kermarrec, T. Lin, and R. Patra. Heterogeneous recommendations: what you might like to read after watching interstellar. Proceedings of the VLDB Endowment, 10(10):1070–1081, 2017

work page 2017

[26] [26]

Session-based Recommendations with Recurrent Neural Networks

B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[27] [27]

Hidasi, M

B. Hidasi, M. Quadrana, A. Karatzoglou, and D. Tikk. Parallel recurrent neural network architectures for feature-rich session-based recommendations. 2016

work page 2016

[28] [28]

Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. arXiv preprint arXiv:1803.00710 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative ﬁltering model. In SIGKDD, pages 426–434. ACM, 2008

work page 2008

[30] [30]

Koren, R

Y. Koren, R. Bell, C. Volinsky, et al. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009

work page 2009

[31] [31]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, 2016

work page 2016

[32] [32]

J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661–670. ACM, 2010

work page 2010

[34] [34]

T. Li, Z. Xu, J. Tang, and Y. Wang. Model-free control for distributed stream data processing using deep reinforcement learning. Proceedings of the VLDB Endowment, 11(6):705–718, 2018. 10

work page 2018

[35] [35]

Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[37] [37]

Mandel, Y.-E

T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Oﬄine policy evaluation across representations with applications to educational games. In AAMAS, pages 1077–1084, 2014

work page 2014

[38] [38]

Michels, A

J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML, pages 593–600. ACM, 2005

work page 2005

[39] [39]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016

work page 1928

[40] [40]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015

work page 2015

[41] [41]

Radlinski, R

F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, pages 784–791. ACM, 2008

work page 2008

[42] [42]

Salakhutdinov, A

R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ﬁltering. In ICML, pages 791–798. ACM, 2007

work page 2007

[43] [43]

Sarwar, G

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative ﬁltering recommendation algorithms. In WWW, pages 285–295. ACM, 2001

work page 2001

[44] [44]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015

work page 2015

[45] [45]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

Silver, A

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016

work page 2016

[47] [47]

Silver, G

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014

work page 2014

[48] [48]

R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning, pages 216–224. Elsevier, 1990

work page 1990

[49] [49]

Y. K. Tan, X. Xu, and Y. Liu. Improved recurrent neural networks for session-based recommendations. arXiv preprint arXiv:1606.08117 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[50] [50]

Theocharous, P

G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI, pages 1806–1812, 2015

work page 2015

[51] [51]

Trummer, S

I. Trummer, S. Moseley, D. Maram, S. Jo, and J. Antonakakis. Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment , 11(12):2074–2077, 2018

work page 2074

[52] [52]

Van den Oord, S

A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, pages 2643–2651, 2013

work page 2013

[53] [53]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016

work page 2016

[54] [54]

H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In SIGKDD, pages 1235–1244. ACM, 2015

work page 2015

[55] [55]

Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[56] [56]

Watter, J

M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, pages 2746–2754, 2015

work page 2015

[57] [57]

Weimer, A

M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola. Maximum margin matrix factorization for collaborative ranking. NIPS, pages 1–8, 2007

work page 2007

[58] [58]

Zhang, Y

J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing, Y. Wang, T. Cheng, L. Liu, M. Ran, and Z. Li. An end-to-end automatic cloud database tuning system using deep reinforcement learning. SIDMOD, 2019

work page 2019

[59] [59]

X. Zhao, W. Zhang, and J. Wang. Interactive collaborative ﬁltering. In CIKM, pages 1411–1420. ACM, 2013

work page 2013

[60] [60]

Zheng, F

G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW, pages 167–176, 2018

work page 2018

[61] [61]

L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. Reinforcement learning to optimize long-term user engagement in recommender systems, 2019. 11

work page 2019