Deep Reinforcement Learning for Personalized Search Story Recommendation
Pith reviewed 2026-05-24 15:36 UTC · model grok-4.3
The pith
A Markov decision process with deep reinforcement learning models both immediate clicks and long-term effects of search story recommendations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling personalized search story recommendation as a Markov decision process, a deep reinforcement learning architecture trained jointly by imitation learning and reinforcement learning captures the immediate and future values of each recommendation, addressing limitations of supervised methods that ignore sequential and cross-channel impacts.
What carries the argument
A deep reinforcement learning architecture inside a Markov decision process framework, trained by both imitation learning and reinforcement learning, to estimate combined immediate and future rewards of search story recommendations.
If this is right
- The model jointly optimizes immediate clicks and downstream effects on user search patterns.
- Imitation learning from historical logs provides an effective starting policy before reinforcement learning refinement.
- Recommendations are selected to maximize cumulative reward over sequences of user interactions.
- The architecture directly incorporates the cross-channel influence that a search story exerts on organic results.
Where Pith is reading between the lines
- The same MDP framing could apply to other recommendation settings where one action alters the user's future state, such as news or video feeds.
- Stronger state representations that explicitly encode multi-session history might further improve the model's ability to handle non-Markovian effects.
- Live A/B tests measuring changes in overall session length or repeat visit rates would provide a direct test of the long-term value modeling.
Load-bearing premise
User search and click behavior can be accurately represented as a Markov decision process in which the current state contains all information needed to predict future rewards without unmodeled history or external factors.
What would settle it
On the JD.com datasets, if a supervised learning baseline matches or exceeds the proposed model on long-term user engagement metrics such as sustained search frequency or cross-channel conversion rates, the advantage of the MDP and RL formulation would be called into question.
Figures
read the original abstract
In recent years, \emph{search story}, a combined display with other organic channels, has become a major source of user traffic on platforms such as e-commerce search platforms, news feed platforms and web and image search platforms. The recommended search story guides a user to identify her own preference and personal intent, which subsequently influences the user's real-time and long-term search behavior. %With such an increased importance of search stories, As search stories become increasingly important, in this work, we study the problem of personalized search story recommendation within a search engine, which aims to suggest a search story relevant to both a search keyword and an individual user's interest. To address the challenge of modeling both immediate and future values of recommended search stories (i.e., cross-channel effect), for which conventional supervised learning framework is not applicable, we resort to a Markov decision process and propose a deep reinforcement learning architecture trained by both imitation learning and reinforcement learning. We empirically demonstrate the effectiveness of our proposed approach through extensive experiments on real-world data sets from JD.com.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript models personalized search story recommendation as a Markov decision process and introduces a deep reinforcement learning architecture trained jointly via imitation learning and reinforcement learning. The central claim is that this approach captures both immediate and long-term (cross-channel) value of recommendations, unlike conventional supervised learning, with effectiveness shown via experiments on real-world JD.com data.
Significance. If the empirical claims hold after addressing modeling assumptions, the work could contribute to recommendation systems by demonstrating how RL can optimize long-term user behavior in search platforms where recommendations influence future cross-channel activity. The use of real-world industrial data is a positive aspect for practical relevance.
major comments (2)
- [Proposed approach / MDP formulation] The MDP formulation (described in the proposed approach) assumes the state—constructed from user profile and current keyword—contains all information needed to predict future rewards. No validation or discussion is provided that user search/click behavior satisfies the Markov property (e.g., independence from unmodeled session history or external factors), which is load-bearing for the claimed advantage of RL over supervised methods.
- [Abstract / Experiments] Abstract and Experiments section: the claim of empirical effectiveness on real-world data supplies no metrics, baselines, ablation studies (e.g., imitation-only vs. full RL), or quantitative results, making it impossible to assess whether the data supports the central claim or to compare against supervised alternatives.
minor comments (1)
- [Abstract] Abstract contains a commented-out sentence fragment (starting with '%With such an increased importance') and an awkward transition ('As search stories become increasingly important, in this work...').
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Proposed approach / MDP formulation] The MDP formulation (described in the proposed approach) assumes the state—constructed from user profile and current keyword—contains all information needed to predict future rewards. No validation or discussion is provided that user search/click behavior satisfies the Markov property (e.g., independence from unmodeled session history or external factors), which is load-bearing for the claimed advantage of RL over supervised methods.
Authors: We agree that the manuscript would benefit from an explicit discussion of the Markov assumption. The state is constructed from user profile and current keyword following common practice in industrial recommendation systems, but we will add a dedicated paragraph in the Proposed Approach section acknowledging the assumption, noting its limitations with respect to unmodeled session history, and explaining why the chosen features make the approximation reasonable for this application. revision: yes
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of empirical effectiveness on real-world data supplies no metrics, baselines, ablation studies (e.g., imitation-only vs. full RL), or quantitative results, making it impossible to assess whether the data supports the central claim or to compare against supervised alternatives.
Authors: The abstract is written at a high level for brevity, but we accept that including key quantitative results would improve transparency. The Experiments section reports results on JD.com data with comparisons to supervised methods; however, we will revise the abstract to state the main performance metrics and expand the Experiments section to explicitly present ablation studies (imitation-only versus joint imitation+RL) and additional baselines so that the empirical claims can be directly evaluated. revision: yes
Circularity Check
No circularity; derivation self-contained with no reductions by construction
full rationale
The provided abstract and text describe an MDP formulation plus a DRL architecture trained via imitation learning and RL to model immediate and long-term values. No equations, fitted parameters, or self-citations are shown that reduce a claimed prediction or result to the inputs by definition. The central modeling choice (Markov property) is an explicit assumption rather than a derived claim that loops back on itself. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. This matches the default expectation of no significant circularity when the text contains no load-bearing reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User search behavior can be modeled as a Markov decision process
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we resort to a Markov decision process and propose a deep reinforcement learning architecture trained by both imitation learning and reinforcement learning
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the reward rt(st,at) can be quantified as the number of clicks, or the number of orders
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep Reinforcement Learning for Personalized Search Story Recommendation
INTRODUCTION Imagine that a customer visits a retail shop to purchase a dress which is to her liking. As the customer walks in, a busi- ness assistant is present to assist the customer by answer- ing questions on fashion trend or suggesting related dresses. In online e-commerce applications, more business units are adding a component that plays a similar ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
RELA TED WORK In this session, we briefly review two topics that are rel- evant to our work, namely reinforcement learning and rec- ommendation/ranking. 2.1 Reinforcement Learning In the general reinforcement learning framework, an agent sequentially interacts with the environment and learns to achieve the best return, which is in the form of accumulated i...
-
[3]
PROBLEM DEFINITION 3.1 Preliminary For ease of presentation, we first introduce the list of no- tations and basic concepts used through the entire work. Specifically, we use lower case symbols u, q, d, p to rep- resent a single user, query, story item, and an item from another channel (e.g., the product item), respectively. Up- per case symbols U, Q, D, P a...
-
[4]
DEEP REINFORCEMENT LEARNING FOR SEARCH STORY RECOMMENDA- TION In this section, we give an overview of our deep reinforce- ment learning framework for personalized search story rec- ommendation, DRESS. Given limited offline data, we pro- pose to combine both model-based augmentation and imi- tation learning with the conventional reinforcement learn- ing. Mod...
-
[5]
THE NEURAL NETWORK DYNAMIC FUNCTION 5.1 Illustrative Overview As introduced earlier, we parameterize the dynamic model Mθ as a neural network function and thus θ represents the weights of neural networks. As illustrated in Figure 2, our dynamic model consists of two units: a reward model MR and a transition model MT . The transition model MT up- dates the...
-
[6]
CONTROLLER REINFORCEMENT LEARNING Our reinforcement learning controller is designed under the traditional actor-critic architecture [3]. Specifically, the controller is a multi-head neural network, which is used as the function approximator for choosing the best story from the story embedding pool. Figure 4 illustrates our network structure of reinforcemen...
-
[7]
IMITA TION AND IMAGINA TION 7.1 Imitation Learning In our search recommendation task, and most other real- world decision-making problems (e.g., finance and health- care), we have access to the logging data of the system being operated by its previous controller, but we do not have ac- cess to an accurate simulator of the system. The goal of the imitation ...
-
[8]
EXPERIMENTAL V ALIDA TION In this section, we conduct extensive experiments with a dataset from a real e-commerce company and evaluate the effectiveness of DRESS. 8.1 Experimental Setup 8.1.1 Dataset We evaluate our methods on a dataset collected between Apr 2018 and Jul 2018 from JD.com [45]. We sampled all search sessions that are related to a category “...
work page 2018
-
[9]
ORIGIN: This is the state-of-the-art implementation of a search story recommendation, that results in the offline data, currently being used by the company
-
[10]
Both follow a power-law distribution
DNNC (Deep Neural Network Classifier): Without considering the cross-channel effect, this method is 7 10 102 103 104 10 50 100 150 200 Number of Users Length of Search Episode (a) 1 50 100 150 200 10 102 103 104 105 Number of Stories Number of Impressions in Sessions (b) Figure 5: Histograms of (a) episode length and (b) story impression frequency. Both fol...
-
[11]
DRESS-m: This is the myopic version of DRESS that only considers immediate short-term reward, which is implemented by setting γ = 0
-
[12]
DRESS-s: This is the simplified version of DRESS with the controller imagination module (Section.7.2) removed. 8.1.3 Evaluation Metric The goal of a search story recommendation is to facili- tate users during the search of products. Therefore, we use search session based user feedback on products as the main performance measure. In particular, we use the p...
-
[13]
Log probability ratio: rationi = log(π(ai|si) b(ai|si) ) for a ses- sion i
-
[14]
Total variation divergence: DTV(b||π)i = 1 2 ∑ a′|π(a′|si)− b(a′|si)| [28]
-
[15]
We calculate the averages of each difference measure over sessions in test data
KL-divergence: DKL(b||π)i = ∑ a′b(a′|si) log(b(a′|si) π(a′|si)). We calculate the averages of each difference measure over sessions in test data. We use the uniform distribution unif for comparison. Results are shown in Table.6. Compared with uniform policy unif, both DRESS and DRESS-s are close to the imitation policy. As expected, the policy ob- tained b...
-
[16]
CONCLUSION Deep reinforcement learning has been successfully used as a powerful method to capture a wide variety of non- trivial user behavior on online platforms (e.g., news feed recommendation, e-commerce search). In this work, fol- lowing these successes, we applied the reinforcement learn- ing framework to the challenging problem of cross-channel sear...
-
[17]
N. Abe, N. Verma, C. Apte, and R. Schroko. Cross channel optimized marketing by reinforcement learning. In SIGKDD, pages 767–772. ACM, 2004
work page 2004
-
[18]
A Brief Survey of Deep Reinforcement Learning
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
An Actor-Critic Algorithm for Sequence Prediction
D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [20]
-
[21]
H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Real-time bidding by reinforcement learning in display advertising. In WSDM, pages 661–670. ACM, 2017
work page 2017
-
[22]
Q. Cai, A. Filos-Ratsikas, P. Tang, and Y. Zhang. Reinforcement mechanism design for e-commerce. In WWW, pages 1339–1348, 2018
work page 2018
-
[23]
P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Recommender System, pages 191–198. ACM, 2016
work page 2016
-
[24]
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-learning with model-based acceleration. In ICML, pages 2829–2838, 2016
work page 2016
-
[25]
R. Guerraoui, A.-M. Kermarrec, T. Lin, and R. Patra. Heterogeneous recommendations: what you might like to read after watching interstellar. Proceedings of the VLDB Endowment, 10(10):1070–1081, 2017
work page 2017
-
[26]
Session-based Recommendations with Recurrent Neural Networks
B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [27]
-
[28]
Y. Hu, Q. Da, A. Zeng, Y. Yu, and Y. Xu. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. arXiv preprint arXiv:1803.00710 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In SIGKDD, pages 426–434. ACM, 2008
work page 2008
- [30]
- [31]
-
[32]
J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In WWW, pages 661–670. ACM, 2010
work page 2010
-
[34]
T. Li, Z. Xu, J. Tang, and Y. Wang. Model-free control for distributed stream data processing using deep reinforcement learning. Proceedings of the VLDB Endowment, 11(6):705–718, 2018. 10
work page 2018
-
[35]
Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[37]
T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Offline policy evaluation across representations with applications to educational games. In AAMAS, pages 1077–1084, 2014
work page 2014
-
[38]
J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In ICML, pages 593–600. ACM, 2005
work page 2005
-
[39]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937, 2016
work page 1928
-
[40]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015
work page 2015
-
[41]
F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, pages 784–791. ACM, 2008
work page 2008
-
[42]
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative filtering. In ICML, pages 791–798. ACM, 2007
work page 2007
- [43]
-
[44]
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015
work page 2015
-
[45]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [46]
- [47]
-
[48]
R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning, pages 216–224. Elsevier, 1990
work page 1990
-
[49]
Y. K. Tan, X. Xu, and Y. Liu. Improved recurrent neural networks for session-based recommendations. arXiv preprint arXiv:1606.08117 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[50]
G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In IJCAI, pages 1806–1812, 2015
work page 2015
-
[51]
I. Trummer, S. Moseley, D. Maram, S. Jo, and J. Antonakakis. Skinnerdb: regret-bounded query evaluation via reinforcement learning. Proceedings of the VLDB Endowment , 11(12):2074–2077, 2018
work page 2074
-
[52]
A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, pages 2643–2651, 2013
work page 2013
-
[53]
H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016
work page 2016
-
[54]
H. Wang, N. Wang, and D.-Y. Yeung. Collaborative deep learning for recommender systems. In SIGKDD, pages 1235–1244. ACM, 2015
work page 2015
-
[55]
Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
- [56]
- [57]
- [58]
-
[59]
X. Zhao, W. Zhang, and J. Wang. Interactive collaborative filtering. In CIKM, pages 1411–1420. ACM, 2013
work page 2013
- [60]
-
[61]
L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin. Reinforcement learning to optimize long-term user engagement in recommender systems, 2019. 11
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.