From Past To Path: Masked History Learning for Next-Item Prediction in Generative Recommendation
Pith reviewed 2026-05-18 13:11 UTC · model grok-4.3
The pith
Reconstructing masked items from user history improves next-item prediction in generative recommendation systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked History Learning augments standard autoregressive training with an auxiliary masked history reconstruction objective. This compels the model to understand why an item path forms from past behaviors. The framework adds an entropy-guided masking policy to target informative historical items and a curriculum learning scheduler that transitions from history reconstruction to future prediction. On three public datasets the resulting models outperform state-of-the-art generative recommenders.
What carries the argument
Masked History Learning (MHL), an auxiliary reconstruction task that compels the model to recover masked items from a user's interaction history.
If this is right
- Models trained with the auxiliary reconstruction task achieve higher next-item accuracy than standard autoregressive generative recommenders.
- Entropy-guided masking selects the most informative historical items for reconstruction rather than random or uniform masking.
- The curriculum scheduler gradually reduces emphasis on history reconstruction in favor of future prediction during training.
- The combined framework yields measurable gains on three public datasets compared with existing generative recommendation methods.
Where Pith is reading between the lines
- The same auxiliary reconstruction principle could be tested in other sequential generative tasks where past context is rich but future labels are sparse.
- If the entropy-guided policy proves critical, simpler masking strategies might underperform when user histories contain clear preference shifts.
- Curriculum scheduling may help stabilize training when the auxiliary task initially competes with the main prediction objective.
Load-bearing premise
The assumption that forcing reconstruction of masked historical items will make the model learn underlying user intent instead of surface-level next-item patterns.
What would settle it
A controlled ablation that removes the masked reconstruction task entirely while keeping all other components fixed, then measures whether next-item accuracy on the three public datasets drops to the level of prior generative models.
Figures
read the original abstract
Generative recommendation, which directly generates item identifiers, has emerged as a promising paradigm for recommendation systems. However, its potential is fundamentally constrained by the reliance on purely autoregressive training. This approach focuses solely on predicting the next item while ignoring the rich internal structure of a user's interaction history, thus failing to grasp the underlying intent. To address this limitation, we propose Masked History Learning (MHL), a novel training framework that shifts the objective from simple next-step prediction to deep comprehension of history. MHL augments the standard autoregressive objective with an auxiliary task of reconstructing masked historical items, compelling the model to understand ``why'' an item path is formed from the user's past behaviors, rather than just ``what'' item comes next. We introduce two key contributions to enhance this framework: (1) an entropy-guided masking policy that intelligently targets the most informative historical items for reconstruction, and (2) a curriculum learning scheduler that progressively transitions from history reconstruction to future prediction. Experiments on three public datasets show that our method significantly outperforms state-of-the-art generative models, highlighting that a comprehensive understanding of the past is crucial for accurately predicting a user's future path.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Masked History Learning (MHL) to address limitations in generative recommendation systems that rely solely on autoregressive next-item prediction. MHL augments the standard objective with an auxiliary masked history reconstruction task, using an entropy-guided masking policy to select informative items and a curriculum learning scheduler that shifts focus from history reconstruction to future prediction. The authors claim this forces models to grasp underlying user intent from past behaviors, leading to better next-item prediction. Experiments on three public datasets reportedly show significant outperformance over state-of-the-art generative models.
Significance. If the performance gains are robust and the mechanism is substantiated, the work could meaningfully advance generative recommendation by showing that explicit history comprehension improves path prediction. The combination of auxiliary reconstruction, entropy-based selection, and curriculum scheduling offers a concrete training framework that could be adopted or extended in sequential recommendation models.
major comments (3)
- [Abstract and §4] Abstract and §4 Experiments: The central claim that the auxiliary masked-history reconstruction task compels the model to understand 'why' an item path forms (rather than surface-level next-item patterns) is load-bearing for the contribution, yet the reported results consist only of next-item accuracy metrics on three datasets with no ablations isolating the reconstruction objective from the entropy-guided masking policy or the curriculum scheduler, and no representation probing or qualitative path analysis to verify intent learning.
- [§3] §3 Method: The entropy-guided masking policy is presented as intelligently targeting the most informative historical items, but without a formal definition or derivation showing how entropy is computed over the history (e.g., no equation for per-item entropy or masking probability), it is unclear whether this policy is parameter-free or introduces additional hyperparameters that could explain performance differences.
- [§4] §4 Experiments: The outperformance is asserted without reported statistical significance tests, standard deviations across multiple runs, or comparisons against non-generative sequential baselines that also use masking or auxiliary objectives, making it difficult to attribute gains specifically to the proposed intent-comprehension mechanism versus generic benefits of multi-task training.
minor comments (2)
- [Abstract] The abstract states 'significantly outperforms' without naming the exact metrics (e.g., HR@10, NDCG@10) or the three datasets; these details should be added for immediate clarity.
- [§3] Notation for the combined loss (autoregressive + reconstruction) is not introduced until the method section; an early equation defining the total objective would improve readability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 Experiments: The central claim that the auxiliary masked-history reconstruction task compels the model to understand 'why' an item path forms (rather than surface-level next-item patterns) is load-bearing for the contribution, yet the reported results consist only of next-item accuracy metrics on three datasets with no ablations isolating the reconstruction objective from the entropy-guided masking policy or the curriculum scheduler, and no representation probing or qualitative path analysis to verify intent learning.
Authors: We acknowledge the importance of substantiating the central claim with more detailed analysis. In the revised version, we have added ablation studies that evaluate the model with and without the entropy-guided masking and the curriculum learning components separately. These results show that both elements contribute to the performance gains. We have also included a qualitative analysis section with examples of masked item reconstructions to demonstrate the model's grasp of user intent. Representation probing is challenging in generative models without additional architecture changes, but the ablations and improved next-item prediction support our hypothesis. We have updated the abstract and §4 to reflect these additions. revision: yes
-
Referee: [§3] §3 Method: The entropy-guided masking policy is presented as intelligently targeting the most informative historical items, but without a formal definition or derivation showing how entropy is computed over the history (e.g., no equation for per-item entropy or masking probability), it is unclear whether this policy is parameter-free or introduces additional hyperparameters that could explain performance differences.
Authors: We appreciate this observation and have revised §3 to include a formal mathematical definition. Specifically, we now provide the equation for computing the entropy of each historical item based on the model's output distribution at that position, and derive the masking probability as a normalized function of these entropies. The core policy is parameter-free, as it uses the model's predictions directly without additional learned parameters, though the overall framework includes the curriculum scheduler which has a tunable transition hyperparameter selected on the validation set. This clarification ensures reproducibility and addresses potential concerns about hidden hyperparameters. revision: yes
-
Referee: [§4] §4 Experiments: The outperformance is asserted without reported statistical significance tests, standard deviations across multiple runs, or comparisons against non-generative sequential baselines that also use masking or auxiliary objectives, making it difficult to attribute gains specifically to the proposed intent-comprehension mechanism versus generic benefits of multi-task training.
Authors: We have updated §4 to include standard deviations computed over five random seeds and statistical significance tests using paired t-tests with p-values reported for key comparisons. For the comparison to non-generative baselines, we maintain that the paper's contribution is within the generative recommendation paradigm, as non-generative models operate under different paradigms (e.g., embedding-based prediction rather than ID generation). However, we have added a paragraph discussing the potential overlap with multi-task learning benefits and why our specific auxiliary task is tailored to generative models. We believe this strengthens the attribution to the intent-comprehension mechanism. revision: partial
Circularity Check
No significant circularity; method is an independent empirical proposal
full rationale
The paper introduces Masked History Learning (MHL) as an augmentation to standard autoregressive training in generative recommendation, adding an auxiliary masked-history reconstruction task along with entropy-guided masking and a curriculum scheduler. No equations, derivations, or first-principles results appear in the provided text. The auxiliary objective is presented as a distinct addition rather than a redefinition or fit of the primary next-item prediction target. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or known results are renamed or smuggled in. The central claim rests on experimental comparisons to baselines on public datasets, which constitute independent empirical content rather than a reduction to the paper's own inputs by construction. This is a standard non-circular proposal of a new training framework.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys), pp.\ 1007--1014, 2023
work page 2023
-
[2]
Yoshua Bengio, J \'e r \^o me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (ICML), pp.\ 41--48, 2009
work page 2009
-
[3]
Sequential recommendation with graph neural networks
Jianxin Chang, Chen Gao, Yu Zheng, Yiqun Hui, Yanan Niu, Yang Song, Depeng Jin, and Yong Li. Sequential recommendation with graph neural networks. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp.\ 378--387, 2021
work page 2021
-
[4]
Rui Chen, Qingyi Hua, Yan-Shuo Chang, Bo Wang, Lei Zhang, and Xiangjie Kong. A survey of collaborative filtering-based recommender systems: From traditional methods to hybrid methods based on social networks. IEEE Access, 6: 0 64301--64320, 2018. doi:10.1109/ACCESS.2018.2877208
-
[5]
Chat-rec: Towards interactive and explainable llms-augmented recommender system
Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. Chat-rec: Towards interactive and explainable llms-augmented recommender system. arXiv preprint arXiv:2303.14524, 2023
-
[6]
Optimized Product Quantization,
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36 0 (4): 0 744--755, 2014. doi:10.1109/TPAMI.2013.240
-
[7]
Sheng, Jiajie Xu, Guanfeng Liu, and Xiaofang Zhou
Yongjing Hao, Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S. Sheng, Jiajie Xu, Guanfeng Liu, and Xiaofang Zhou. Feature-level deeper self-attention network with contrastive learning for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering (TKDE), 35 0 (10): 0 10112--10124, 2023. doi:10.1109/TKDE.2023.3250463
-
[8]
Leveraging large language models for sequential recommendation
Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach, and Marios Fragkoulis. Leveraging large language models for sequential recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys), pp.\ 1096--1102, 2023
work page 2023
-
[9]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.\ 770--778, 2016
work page 2016
-
[10]
Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW), pp.\ 507--517, 2016
work page 2016
-
[11]
A survey on user behavior modeling in recommender systems
Zhicheng He, Weiwen Liu, Wei Guo, Jiarui Qin, Yingxue Zhang, Yaochen Hu, and Ruiming Tang. A survey on user behavior modeling in recommender systems. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI , pp.\ 6656--6664, 2023
work page 2023
-
[12]
Session-based recommendations with recurrent neural networks
Bal \' a zs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. In 4th International Conference on Learning Representations (ICLR), 2016
work page 2016
-
[13]
Learning vector-quantized item representation for transferable sequential recommenders
Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. Learning vector-quantized item representation for transferable sequential recommenders. In Proceedings of the ACM Web Conference 2023 (WWW), pp.\ 1162--1171, 2023
work page 2023
-
[14]
Generating long semantic ids in parallel for recommendation
Yupeng Hou, Jiacheng Li, Ashley Shin, Jinsung Jeon, Abhishek Santhanam, Wei Shao, Kaveh Hassani, Ning Yao, and Julian McAuley. Generating long semantic ids in parallel for recommendation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp.\ 956--966, 2025 a
work page 2025
-
[15]
Generative recommendation models: Progress and directions
Yupeng Hou, An Zhang, Leheng Sheng, Zhengyi Yang, Xiang Wang, Tat-Seng Chua, and Julian McAuley. Generative recommendation models: Progress and directions. In Companion Proceedings of the ACM on Web Conference 2025 (WWW), pp.\ 13--16, 2025 b
work page 2025
-
[16]
How to index item ids for recommendation foundation models
Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. How to index item ids for recommendation foundation models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP), pp.\ 195--204, 2023
work page 2023
-
[17]
Self-attentive sequential recommendation
Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp.\ 197--206, 2018
work page 2018
-
[18]
Neural attentive session-based recommendation
Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), pp.\ 1419--1428, 2017
work page 2017
-
[19]
A survey on deep neural networks in collaborative filtering recommendation systems, 2024
Pang Li, Shahrul Azman Mohd Noah, and Hafiz Mohd Sarim. A survey on deep neural networks in collaborative filtering recommendation systems, 2024
work page 2024
-
[20]
Hierarchical gating networks for sequential recommendation
Chen Ma, Peng Kang, and Xue Liu. Hierarchical gating networks for sequential recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pp.\ 825--833, 2019
work page 2019
-
[21]
David J. C. MacKay. Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981
work page 2002
-
[22]
Image-based recommendations on styles and substitutes
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp.\ 43--52, 2015
work page 2015
-
[23]
Generative representational instruction tuning
Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In The Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[24]
Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 1864--1874, 2022
work page 2022
-
[25]
Aleksandr V. Petrov and Craig Macdonald. Recjpq: Training large-catalogue sequential recommenders. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 538--547, 2024
work page 2024
-
[26]
User modeling and user profiling: A comprehensive survey, 2024
Erasmo Purificato, Ludovico Boratto, and Ernesto William De Luca. User modeling and user profiling: A comprehensive survey, 2024
work page 2024
-
[27]
Tran, Jonah Samost, Maciej Kula, Ed H
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. Recommender systems with generative retrieval. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[28]
Factorizing personalized markov chains for next-basket recommendation
Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), pp.\ 811--820, 2010
work page 2010
-
[29]
Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp.\ 1441--1450, 2019
work page 2019
-
[30]
Personalized top-n sequential recommendation via convolutional sequence embedding
Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 565--573, 2018
work page 2018
-
[31]
Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, and Donald Metzler. Transformer memory as a differentiable search index. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[32]
Collaborative deep learning for recommender systems
Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp.\ 1235--1244, 2015
work page 2015
-
[33]
Learnable item tokenization for generative recommendation
Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. Learnable item tokenization for generative recommendation. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), pp.\ 2400--2409, 2024
work page 2024
-
[34]
A neural corpus indexer for document retrieval
Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun, Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie, Hao Allen Sun, Weiwei Deng, Qi Zhang, and Mao Yang. A neural corpus indexer for document retrieval. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[35]
Session-based recommendation with graph neural networks
Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 346--353, 2019
work page 2019
-
[36]
Where to go next for recommender systems? id- vs
Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. Where to go next for recommender systems? id- vs. modality-based recommender models revisited. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp.\ 2639--2649, 2023
work page 2023
-
[37]
Linear recurrent units for sequential recommendation
Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian Mcauley, and Dong Wang. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM), pp.\ 930--938, 2024
work page 2024
-
[38]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations. In Proceedings of the 41st International Conference on Machine Learning (ICML), pp.\ 58484--58509, 2024
work page 2024
-
[39]
S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization
Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM), pp.\ 1893--1902, 2020
work page 1902
-
[40]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[41]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[42]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[43]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.