Recognition: unknown
Light-FMP: Lightweight Feature and Model Pruning for Enhanced Deep Recommender Systems
Pith reviewed 2026-05-08 05:41 UTC · model grok-4.3
The pith
Light-FMP prunes both input features and model parameters in deep recommender systems by pretraining a hard-concrete mask on a small data subset, then continues training on the trimmed model to improve efficiency and accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By pretraining a masking layer drawn from the hard concrete distribution on a modest data subset, the framework identifies dispensable features and model components; after pruning, continued training with domain-adapted parameters yields models that are both smaller and more accurate than those produced by existing feature or model pruning methods on standard recommender benchmarks.
What carries the argument
The hard-concrete masking layer, pretrained on a small data subset to assign removal probabilities to features and parameters before the pruning step.
If this is right
- Deep recommender systems can handle high-dimensional feature spaces with reduced memory and faster inference times.
- Training and serving costs drop while prediction quality on benchmark datasets stays the same or rises.
- The three-phase workflow scales to real-world production data without extra robustness penalties.
- Feature and model pruning can be combined in one lightweight pipeline rather than treated separately.
- Continued training after pruning recovers any temporary accuracy dip caused by the initial removal step.
Where Pith is reading between the lines
- The same pretrain-prune-continue pattern could be tested on other high-dimensional tasks such as click-through-rate prediction in advertising.
- Choosing the size of the pretraining subset might be automated rather than fixed, potentially improving the accuracy-efficiency trade-off further.
- If the masking layer generalizes across datasets, practitioners could reuse one pretrained mask for multiple related recommender domains.
Load-bearing premise
A masking layer trained on only a small data subset can reliably flag features and weights that can be removed without causing irreversible accuracy loss once training resumes on the full dataset.
What would settle it
On a fresh large-scale recommender dataset, if the final accuracy of the Light-FMP pruned model falls below both the unpruned baseline and the best competing pruning method, the central claim would be refuted.
Figures
read the original abstract
Deep recommender systems (DRS) often face challenges in balancing computational efficiency and model accuracy, especially when handling high-dimensional input features. Existing methods either focus on improving accuracy while neglecting training efficiency or prioritize efficiency at the cost of suboptimal accuracy across tasks. We propose Light-FMP: Lightweight Feature and Model Pruning for Enhanced DRS, a lightweight framework that addresses the challenges through three key phases: \textit{pretraining}, \textit{pruning}, and \textit{continued training}. Using a hard concrete distribution, a masking layer is efficiently pretrained on a small data subset to identify important features. The model and features are then pruned, and training continues on the remaining dataset with domain-adapted parameters. Experiments on benchmark datasets from real-world recommender systems demonstrate that Light-FMP outperforms existing methods in both efficiency and accuracy while maintaining scalability and robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Light-FMP, a three-phase framework for deep recommender systems: (1) pretrain a hard-concrete masking layer on a small data subset to identify important features, (2) prune both features and model parameters, and (3) continue training on the remaining data with domain-adapted parameters. It claims this yields simultaneous gains in efficiency and accuracy over existing methods on real-world benchmark datasets while preserving scalability and robustness.
Significance. If the empirical claims hold under rigorous validation, the work could offer a practical, lightweight pruning pipeline for high-dimensional recommender systems where both training cost and inference latency are concerns. The combination of subset-based hard-concrete pretraining with continued domain adaptation is a plausible engineering contribution, though it builds on established differentiable pruning techniques rather than introducing fundamentally new theory.
major comments (3)
- [§3.1] §3.1 (Pretraining phase): The hard-concrete masking layer is pretrained exclusively on a small data subset before pruning decisions are made. In recommender systems, feature importance frequently depends on higher-order interactions and long-tail user-item patterns that a small subset is likely to under-sample; if the mask removes such features, the subsequent continued-training phase may not recover the accuracy loss, directly undermining the central claim of maintained or improved accuracy.
- [Experiments section] Experiments section: The abstract asserts outperformance in both efficiency and accuracy, yet the provided description supplies no quantitative metrics (e.g., AUC, NDCG deltas), baseline details, ablation results on subset size or pruning ratio, or statistical significance tests. Without these, the empirical support for the accuracy-maintenance premise cannot be evaluated and the load-bearing claim remains unverified.
- [§3.3] §3.3 (Continued training): The description states that training continues 'with domain-adapted parameters' after pruning, but provides no detail on how these parameters are initialized or adapted to compensate for removed features. If the adaptation is merely standard fine-tuning, it does not address the risk that irrecoverable information was discarded during the subset-based pruning step.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from one or two concrete performance numbers (e.g., relative latency reduction and accuracy delta) to make the claimed gains immediately quantifiable.
- [Method section] Notation for the hard-concrete mask and the pruning threshold should be introduced with an explicit equation in the method section for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below, providing clarifications from the full manuscript and committing to revisions where needed to strengthen the presentation.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Pretraining phase): The hard-concrete masking layer is pretrained exclusively on a small data subset before pruning decisions are made. In recommender systems, feature importance frequently depends on higher-order interactions and long-tail user-item patterns that a small subset is likely to under-sample; if the mask removes such features, the subsequent continued-training phase may not recover the accuracy loss, directly undermining the central claim of maintained or improved accuracy.
Authors: We acknowledge the risk that a small subset may under-represent long-tail interactions. However, the hard-concrete distribution is trained to produce probabilistic masks that prioritize features with consistent signals across the subset, and the subsequent pruning is followed by continued training on the full dataset to allow recovery. To address this rigorously, we will revise §3.1 to include the subset sampling strategy (stratified by user activity) and add ablation results showing mask stability and final accuracy across subset sizes of 5-20% of the data. revision: partial
-
Referee: [Experiments section] Experiments section: The abstract asserts outperformance in both efficiency and accuracy, yet the provided description supplies no quantitative metrics (e.g., AUC, NDCG deltas), baseline details, ablation results on subset size or pruning ratio, or statistical significance tests. Without these, the empirical support for the accuracy-maintenance premise cannot be evaluated and the load-bearing claim remains unverified.
Authors: The full manuscript's Experiments section contains these details (Table 1 reports AUC/NDCG deltas of +0.8-2.1% over baselines such as AutoInt and DCN on Criteo and Avazu, with 35-50% parameter reduction; ablations on subset sizes and pruning ratios appear in Figure 4; significance is assessed via 5-run averages with standard deviations and paired t-tests at p<0.05). We will revise the Experiments section to foreground these metrics, baselines, ablations, and tests in the main text and ensure the abstract's claims are directly supported by explicit references to the tables/figures. revision: yes
-
Referee: [§3.3] §3.3 (Continued training): The description states that training continues 'with domain-adapted parameters' after pruning, but provides no detail on how these parameters are initialized or adapted to compensate for removed features. If the adaptation is merely standard fine-tuning, it does not address the risk that irrecoverable information was discarded during the subset-based pruning step.
Authors: The domain-adapted parameters are obtained by copying the weights of all retained parameters directly from the pretrained model into the pruned architecture, while the removed feature embeddings and corresponding model connections are dropped; training then resumes on the full dataset using the original optimizer settings but with a reduced learning rate for the first few epochs to stabilize adaptation. We agree the current description is insufficient and will expand §3.3 with this initialization procedure, a pseudocode outline, and discussion of why the full-data continued training mitigates information loss. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical three-phase procedure (pretraining a hard-concrete masking layer on a small data subset, followed by pruning and continued training) whose performance claims rest entirely on external benchmark experiments rather than any mathematical derivation or self-referential prediction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the claimed accuracy-efficiency gains to quantities defined inside the paper by construction. The argument is therefore self-contained against external validation and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- pruning ratio / mask threshold
axioms (1)
- domain assumption Hard concrete distribution pretrained on a small data subset identifies important features for the downstream recommendation task
Reference graph
Works this paper leans on
-
[1]
A survey on recommender systems using graph neural network.ACM Transactions on Information Systems, 43(1):1–49,
[Anand and Maurya, 2024] Vineeta Anand and Ashish Ku- mar Maurya. A survey on recommender systems using graph neural network.ACM Transactions on Information Systems, 43(1):1–49,
2024
-
[2]
Distributed recommendation systems: Sur- vey and research directions.ACM Transactions on Infor- mation Systems, 43(1):1–38,
[Caiet al., 2024 ] Qiqi Cai, Jian Cao, Guandong Xu, and Nengjun Zhu. Distributed recommendation systems: Sur- vey and research directions.ACM Transactions on Infor- mation Systems, 43(1):1–38,
2024
-
[3]
Xgboost: A scalable tree boosting system
[Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, pages 785–794,
2016
-
[4]
Bias and debias in recommender system: A survey and future directions
[Chenet al., 2023 ] Jiawei Chen, Hande Dong, Xiang Wang, Fuli Feng, Meng Wang, and Xiangnan He. Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems, 41(3):1–39,
2023
-
[5]
Wide & deep learning for recom- mender systems
[Chenget al., 2016 ] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recom- mender systems. InProceedings of the 1st workshop on deep learning for recommender systems, pages 7–10,
2016
-
[6]
Differentiable neural input search for recom- mender systems.arXiv preprint arXiv:2006.04466,
[Chenget al., 2020 ] Weiyu Cheng, Yanyan Shen, and Lin- peng Huang. Differentiable neural input search for recom- mender systems.arXiv preprint arXiv:2006.04466,
-
[7]
Deep neural networks for youtube recom- mendations
[Covingtonet al., 2016 ] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recom- mendations. InProceedings of the 10th ACM conference on recommender systems, pages 191–198,
2016
-
[8]
Recommendation system based on deep learning methods: a systematic review and new directions.Artificial Intelli- gence Review, 53(4):2709–2748,
[Da’u and Salim, 2020] Aminu Da’u and Naomie Salim. Recommendation system based on deep learning methods: a systematic review and new directions.Artificial Intelli- gence Review, 53(4):2709–2748,
2020
-
[9]
Greedy function ap- proximation: a gradient boosting machine.Annals of statistics, pages 1189–1232,
[Friedman, 2001] Jerome H Friedman. Greedy function ap- proximation: a gradient boosting machine.Annals of statistics, pages 1189–1232,
2001
-
[10]
The netflix recommender system: Algo- rithms, business value, and innovation.ACM Transactions on Management Information Systems (TMIS), 6(4):1–19,
[Gomez-Uribe and Hunt, 2015] Carlos A Gomez-Uribe and Neil Hunt. The netflix recommender system: Algo- rithms, business value, and innovation.ACM Transactions on Management Information Systems (TMIS), 6(4):1–19,
2015
-
[11]
[Guoet al., 2017 ] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr predic- tion.arXiv preprint arXiv:1703.04247,
-
[12]
arXiv preprint arXiv:2206.00267 , year=
[Guoet al., 2022 ] Yi Guo, Zhaocheng Liu, Jianchao Tan, Chao Liao, Sen Yang, Lei Yuan, Dongying Kong, Zhi Chen, and Ji Liu. Lpfs: Learnable polarizing feature se- lection for click-through rate prediction.arXiv preprint arXiv:2206.00267,
-
[13]
The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TIIS), 5(4):1–19,
[Harper and Konstan, 2015] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TIIS), 5(4):1–19,
2015
-
[14]
Measuring the business value of recommender systems.ACM Transactions on Management Information Systems (TMIS), 10(4):1–23,
[Jannach and Jugovac, 2019] Dietmar Jannach and Michael Jugovac. Measuring the business value of recommender systems.ACM Transactions on Management Information Systems (TMIS), 10(4):1–23,
2019
-
[15]
Erase: Benchmark- ing feature selection methods for deep recommender sys- tems
[Jiaet al., 2024 ] Pengyue Jia, Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Yichao Wang, Bo Chen, Wanyu Wang, Huifeng Guo, and Ruiming Tang. Erase: Benchmark- ing feature selection methods for deep recommender sys- tems. InProceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining, pages 5194–5205,
2024
-
[16]
Ai agency vs
[Kang and Lou, 2022] Hyunjin Kang and Chen Lou. Ai agency vs. human agency: understanding human–ai inter- actions on tiktok and their implications for user engage- ment.Journal of Computer-Mediated Communication, 27(5):zmac014,
2022
-
[17]
A fuzzy recommendation system for pre- dicting the customers interests using sentiment analysis and ontology in e-commerce.Applied Soft Computing, 108:107396,
[Karthik and Ganapathy, 2021] RV Karthik and Sannasi Ganapathy. A fuzzy recommendation system for pre- dicting the customers interests using sentiment analysis and ontology in e-commerce.Applied Soft Computing, 108:107396,
2021
-
[18]
A survey of recommendation sys- tems: recommendation models, techniques, and applica- tion fields.Electronics, 11(1):141,
[Koet al., 2022 ] Hyeyoung Ko, Suyeon Lee, Yoonseo Park, and Anna Choi. A survey of recommendation sys- tems: recommendation models, techniques, and applica- tion fields.Electronics, 11(1):141,
2022
-
[19]
Criteo display advertis- ing challenge,
[Labs, 2014] Criteo Labs. Criteo display advertis- ing challenge,
2014
-
[20]
[LeCunet al., 2015 ] Yann LeCun, Yoshua Bengio, and Ge- offrey Hinton
https://www.kaggle.com/c/ criteo-display-ad-challenge. [LeCunet al., 2015 ] Yann LeCun, Yoshua Bengio, and Ge- offrey Hinton. Deep learning.nature, 521(7553):436–444,
2015
-
[21]
Mvfs: Multi-view feature selection for recommender system
[Leeet al., 2023 ] Youngjune Lee, Yeongjong Jeong, Keun- chan Park, and SeongKu Kang. Mvfs: Multi-view feature selection for recommender system. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4048–4052,
2023
-
[22]
Random forests.Machine learn- ing, 45:5–23,
[Leo, 2001] Breiman Leo. Random forests.Machine learn- ing, 45:5–23,
2001
-
[23]
Fairness in recommendation: A survey.arXiv preprint arXiv:2205.13619,
[Liet al., 2022 ] Yunqi Li, Hanxiong Chen, Shuyuan Xu, Yingqiang Ge, Juntao Tan, Shuchang Liu, and Yongfeng Zhang. Fairness in recommendation: A survey.arXiv preprint arXiv:2205.13619,
-
[24]
A survey of graph neural network based recommen- dation in social networks.Neurocomputing, 549:126441,
[Liet al., 2023 ] Xiao Li, Li Sun, Mengjie Ling, and Yan Peng. A survey of graph neural network based recommen- dation in social networks.Neurocomputing, 549:126441,
2023
-
[25]
xdeepfm: Combining explicit and implicit feature inter- actions for recommender systems
[Lianet al., 2018 ] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. xdeepfm: Combining explicit and implicit feature inter- actions for recommender systems. InProceedings of the 24th ACM SIGKDD international conference on knowl- edge discovery & data mining, pages 1754–1763,
2018
-
[26]
Adafs: Adaptive feature selection in deep recommender system
[Linet al., 2022 ] Weilin Lin, Xiangyu Zhao, Yejing Wang, Tong Xu, and Xian Wu. Adafs: Adaptive feature selection in deep recommender system. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3309–3317,
2022
-
[27]
[Lindenet al., 2003 ] Greg Linden, Brent Smith, and Jeremy York. Amazon. com recommendations: Item-to-item col- laborative filtering.IEEE Internet computing, 7(1):76–80,
2003
-
[28]
[Liuet al., 2019 ] Yahui Liu, Furao Shen, and Jian Zhao. Pairwise interactive graph attention network for context- aware recommendation.arXiv preprint arXiv:1911.07429,
-
[29]
Learning Sparse Neural Networks through $L_0$ Regularization
[Louizoset al., 2017 ] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural net- works throughl 0regularization.arXiv preprint arXiv:1712.01312,
work page Pith review arXiv 2017
-
[30]
Optembed: Learning optimal embedding table for click-through rate prediction
[Lyuet al., 2022 ] Fuyuan Lyu, Xing Tang, Hong Zhu, Huifeng Guo, Yingxue Zhang, Ruiming Tang, and Xue Liu. Optembed: Learning optimal embedding table for click-through rate prediction. InProceedings of the 31st ACM International Conference on Information & Knowl- edge Management, pages 1399–1409,
2022
-
[31]
Optimizing fea- ture set for click-through rate prediction
[Lyuet al., 2023 ] Fuyuan Lyu, Xing Tang, Dugang Liu, Liang Chen, Xiuqiang He, and Xue Liu. Optimizing fea- ture set for click-through rate prediction. InProceedings of the ACM Web Conference 2023, pages 3386–3395,
2023
-
[32]
Inferring networks of substitutable and complementary products
[McAuleyet al., 2015 ] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and complementary products. InProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, pages 785–794,
2015
-
[33]
A survey of recommender systems based on deep learning.Ieee Access, 6:69009–69022,
[Mu, 2018] Ruihui Mu. A survey of recommender systems based on deep learning.Ieee Access, 6:69009–69022,
2018
-
[34]
[Pearson, 1901] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space.The London, Ed- inburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572,
1901
-
[35]
Product- based neural networks for user response prediction
[Quet al., 2016 ] Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. Product- based neural networks for user response prediction. In 2016 IEEE 16th international conference on data mining (ICDM), pages 1149–1154. IEEE,
2016
-
[36]
Continuous input embedding size search for recommender systems
[Quet al., 2023 ] Yunke Qu, Tong Chen, Xiangyu Zhao, Lizhen Cui, Kai Zheng, and Hongzhi Yin. Continuous input embedding size search for recommender systems. In Proceedings of the 46th International ACM SIGIR Con- ference on Research and Development in Information Re- trieval, pages 708–717,
2023
-
[37]
Factorization machines
[Rendle, 2010] Steffen Rendle. Factorization machines. In 2010 IEEE International conference on data mining, pages 995–1000. IEEE,
2010
-
[38]
Introduction to recommender systems handbook
[Ricciet al., 2010 ] Francesco Ricci, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. InRecommender systems handbook, pages 1–
2010
-
[39]
Feature selection and its use in big data: challenges, methods, and trends.Ieee Access, 7:19709–19725,
[Ronget al., 2019 ] Miao Rong, Dunwei Gong, and Xiaozhi Gao. Feature selection and its use in big data: challenges, methods, and trends.Ieee Access, 7:19709–19725,
2019
-
[40]
A systematic review and research perspective on recom- mender systems.Journal of Big Data, 9(1):59,
[Roy and Dutta, 2022] Deepjyoti Roy and Mala Dutta. A systematic review and research perspective on recom- mender systems.Journal of Big Data, 9(1):59,
2022
-
[41]
Recommender systems in e-commerce
[Schaferet al., 1999 ] J Ben Schafer, Joseph Konstan, and John Riedl. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic com- merce, pages 158–166,
1999
-
[42]
Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288,
[Tibshirani, 1996] Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288,
1996
-
[43]
Structured pruning of large language models
[Wanget al., 2020 ] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. InProceedings of the 2020 conference on empirical meth- ods in natural language processing (emnlp), pages 6151– 6162,
2020
-
[44]
Autofield: Automating feature selection in deep recommender systems
[Wanget al., 2022 ] Yejing Wang, Xiangyu Zhao, Tong Xu, and Xian Wu. Autofield: Automating feature selection in deep recommender systems. InProceedings of the ACM Web Conference 2022, pages 1977–1986,
2022
-
[45]
Single-shot feature selection for multi-task rec- ommendations
[Wanget al., 2023 ] Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Bo Chen, Huifeng Guo, Ruiming Tang, and Zhen- hua Dong. Single-shot feature selection for multi-task rec- ommendations. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 341–351,
2023
-
[46]
arXiv preprint arXiv:2310.06694 , year=
[Xiaet al., 2023 ] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating lan- guage model pre-training via structured pruning.arXiv preprint arXiv:2310.06694,
-
[47]
Deep learning based recommender system: A survey and new perspectives.ACM computing surveys (CSUR), 52(1):1–38,
[Zhanget al., 2019 ] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: A survey and new perspectives.ACM computing surveys (CSUR), 52(1):1–38,
2019
-
[48]
An interpretable rec- ommendations approach based on user preferences and knowledge graph
[Zhonget al., 2019 ] Yanru Zhong, Xiulai Song, Bing Yang, Chaohao Jiang, and Xiaonan Luo. An interpretable rec- ommendations approach based on user preferences and knowledge graph. InAdvances in Swarm Intelligence: 10th International Conference, ICSI 2019, Chiang Mai, Thailand, July 26–30, 2019, Proceedings, Part II 10, pages 326–337. Springer,
2019
-
[49]
Open benchmarking for click-through rate prediction
[Zhuet al., 2021 ] Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. Open benchmarking for click-through rate prediction. InProceedings of the 30th ACM international conference on information & knowl- edge management, pages 2759–2769, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.