pith. sign in

arxiv: 2604.20847 · v1 · submitted 2026-02-10 · 💻 cs.IR · cs.AI

Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

Pith reviewed 2026-05-16 02:35 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords music recommendationcontent-based filteringself-supervised audio encodersfeature aggregationmultimodal recommendationcold-start problemMuQ-token
0
0 comments X

The pith

Features from large-scale self-supervised music encoders, aggregated via the MuQ-token method, improve recall and CTR in music recommendation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that audio representations extracted from recent large-scale self-supervised music encoders add measurable value to both candidate recall and click-through-rate prediction in music recommendation systems. Existing collaborative filtering approaches ignore intrinsic audio properties and perform poorly when user history is limited, yet prior datasets rarely supply raw audio or text metadata for testing richer alternatives. To close this gap the authors release the TASTE dataset and benchmark, which pairs tracks with both audio waveforms and textual descriptions. Their central technical contribution is the MuQ-token aggregation technique that folds multi-layer encoder outputs into recommendation models more efficiently than standard pooling or concatenation methods, delivering consistent gains across recall and ranking tasks.

Core claim

Audio representations obtained from large-scale self-supervised music encoders substantially improve candidate recall and CTR prediction when integrated into recommendation models. The MuQ-token method, which reformulates multi-layer features as tokens for efficient processing, outperforms alternative aggregation strategies on the TASTE dataset across multiple experimental settings. These results establish the practical utility of content-driven signals and supply a multimodal dataset and evaluation framework that supports future work beyond pure collaborative filtering.

What carries the argument

The MuQ-token method, which converts multi-layer outputs from a music encoder into a token sequence for efficient aggregation into downstream recommendation models.

If this is right

  • Content audio features can mitigate cold-start problems by supplying track-intrinsic signals when user history is absent.
  • Efficient multi-layer aggregation makes deeper encoder outputs feasible in production recommendation pipelines without prohibitive compute cost.
  • The TASTE benchmark enables direct comparison of multimodal methods against collaborative baselines on identical data splits.
  • Hybrid systems that combine collaborative and content signals become easier to evaluate and deploy once reusable audio features are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the observed gains persist on live traffic, platforms may reduce reliance on user-history data and serve recommendations earlier in a user's lifecycle.
  • The token-based aggregation pattern could transfer to other media types such as video or podcast recommendation where layered encoder outputs are already common.
  • Plugging newer or larger music encoders into the same MuQ-token pipeline would provide a direct test of whether further scaling yields additional lifts.

Load-bearing premise

The audio features and performance patterns observed on the TASTE dataset will hold for real user behavior on live music streaming platforms.

What would settle it

An online A/B test on a commercial music service in which the MuQ-token audio features produce no increase or a decrease in recall or CTR relative to a strong collaborative-filtering baseline.

Figures

Figures reproduced from arXiv: 2604.20847 by Da-Wei Zhou, De-Chuan Zhan, Jia-Qi Yang, Yizhi Zhou.

Figure 1
Figure 1. Figure 1: The overall framework of our proposed music recommendation system. The workflow establishes a comprehensive [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Similarity of Clustering Results Across Different [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model Performance Comparison Across Different [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: AUC scores under different numbers of clusters. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Item features distribution across time [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: User features distribution across time. 2005-06 2006-06 2007-06 2008-06 2009-06 2010-06 2011-06 2012-06 2013-06 2014-06 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbf{MuQ-token} method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at https://github.com/zreach/TASTE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TASTE dataset and benchmarking framework for music recommendation, integrating audio and textual modalities. It extracts representations from large-scale self-supervised music encoders and proposes the MuQ-token method for efficient multi-layer feature aggregation, claiming consistent outperformance over other integration techniques on candidate recall and CTR tasks.

Significance. If the empirical results hold under rigorous validation, the work supplies a reusable multimodal dataset and framework that quantifies the value of content-based audio features in music recommendation, especially for cold-start settings, and offers a practical aggregation technique that could be adopted in production systems.

major comments (2)
  1. [Experimental Evaluation] Experimental section: the abstract asserts consistent outperformance of MuQ-token across settings, yet no details on train/test splits, statistical significance tests, error bars, or hyperparameter sensitivity are visible; without these the headline claim cannot be verified and the generalization concern raised by the skeptic remains open.
  2. [Results] §4 (or equivalent results section): internal comparisons to other aggregation methods on TASTE alone do not address external validity; if the multi-layer features primarily exploit dataset-specific correlations rather than transferable audio properties, the reported gains may not replicate on other corpora or live logs.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'consistently outperforms' would be strengthened by reporting the magnitude of gains (e.g., relative recall@10 lift) rather than qualitative language.
  2. [Method] Notation: clarify whether MuQ-token operates on frozen encoder layers or allows fine-tuning, as this affects reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions to strengthen the experimental reporting and discussion of generalizability.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental section: the abstract asserts consistent outperformance of MuQ-token across settings, yet no details on train/test splits, statistical significance tests, error bars, or hyperparameter sensitivity are visible; without these the headline claim cannot be verified and the generalization concern raised by the skeptic remains open.

    Authors: We agree that these experimental details are essential for verifying the claims. In the revised manuscript, we will expand the experimental section to explicitly describe the train/test splits, include statistical significance tests (such as paired t-tests with p-values), report error bars as standard deviations over multiple random seeds, and add a hyperparameter sensitivity analysis for MuQ-token and competing methods. These additions will directly support the outperformance assertions. revision: yes

  2. Referee: [Results] §4 (or equivalent results section): internal comparisons to other aggregation methods on TASTE alone do not address external validity; if the multi-layer features primarily exploit dataset-specific correlations rather than transferable audio properties, the reported gains may not replicate on other corpora or live logs.

    Authors: We acknowledge the valid concern regarding external validity. While TASTE is introduced as a reusable benchmark, the current results are indeed internal to this dataset. In the revision, we will add an explicit limitations subsection discussing the risk of dataset-specific correlations and the transferability of audio representations from large-scale models. We will also release the full code, preprocessed features, and evaluation scripts to enable straightforward replication on other corpora, and note replication on additional public datasets as important future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical claims on new dataset and aggregation method

full rationale

The paper introduces the TASTE dataset and MuQ-token aggregation technique, then reports experimental comparisons showing outperformance on recall and CTR tasks. No derivation chain, equations, or self-citations are used to derive results from inputs by construction. All load-bearing claims rest on direct empirical evaluation against baselines, which is independent of any fitted parameter renaming or self-referential definition. The reader's assessment of score 1.0 is consistent with the absence of any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and relies on pre-existing large-scale music encoders and standard recommendation evaluation practices. No new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5534 in / 1123 out tokens · 79752 ms · 2026-05-16T02:35:08.474422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    Geoffray Bonnin and Dietmar Jannach. 2014. Automated generation of music playlists: Survey and experiments.ACM Computing Surveys (CSUR)47, 2 (2014), 1–35

  2. [2]

    Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al

  3. [3]

    InProceedings of the 1st workshop on deep learning for recommender systems

    Wide & deep learning for recommender systems. InProceedings of the 1st workshop on deep learning for recommender systems. 7–10

  4. [4]

    Yashar Deldjoo, Markus Schedl, and Peter Knees. 2024. Content-driven music recommendation: Evolution, state of the art, and challenges.Computer Science Review51 (2024), 100618

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  6. [6]

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang

  7. [7]

    InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  8. [8]

    Benjamin Elizalde, Soham Deshmukh, and Huaming Wang. 2023. Nat- ural Language Supervision for General-Purpose Audio Representations. arXiv:2309.05767 [cs.SD] https://arxiv.org/abs/2309.05767

  9. [9]

    Huifeng Guo, Bo Chen, Ruiming Tang, Weinan Zhang, Zhenguo Li, and Xiuqiang He. 2021. An Embedding Learning Framework for Numerical Features in CTR Prediction. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining(Virtual Event, Singapore)(KDD ’21). Association for Computing Machinery, New York, NY, USA, 2910–2918. doi:10.114...

  10. [10]

    Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction.arXiv preprint arXiv:1703.04247(2017)

  11. [11]

    Zhiqiang Guo, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. 2024. Lgmrec: Local and global graph learning for multimodal recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8454–8462

  12. [12]

    Ruining He and Julian McAuley. 2016. VBPR: visual Bayesian Personalized Ranking from implicit feedback. InProceedings of the Thirtieth AAAI Conference on Artificial Intelligence(Phoenix, Arizona)(AAAI’16). AAAI Press, 144–150

  13. [13]

    Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval(Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 355–364. doi:10.1145/3077136.3080777

  14. [14]

    Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. 2014. Practical Lessons from Predicting Clicks on Ads at Facebook. InProceedings of the Eighth International Workshop on Data Mining for Online Advertising(New York, NY, USA)(ADKDD’14). Association for Comp...

  15. [15]

    Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field- aware factorization machines for CTR prediction. InProceedings of the 10th ACM conference on recommender systems. 43–50

  16. [16]

    Ioannis Konstas, Vassilios Stathopoulos, and Joemon M Jose. 2009. On social networks and collaborative recommendation. InProceedings of the 32nd interna- tional ACM SIGIR conference on Research and development in information retrieval. 195–202. Conference’17, July 2017, Washington, DC, USA Zhou et al

  17. [17]

    Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dan- nenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Yike Guo, and Jie Fu. 2023. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. arXiv:2306.00107 [cs.SD]

  18. [18]

    Zekun Li, Zeyu Cui, Shu Wu, Xiaoyu Zhang, and Liang Wang. 2019. Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction. In Proceedings of the 28th ACM international conference on information and knowledge management. 539–548

  19. [19]

    Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in- teractions for recommender systems. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763

  20. [20]

    Xinyu Lin, Wenjie Wang, Jujia Zhao, Yongqi Li, Fuli Feng, and Tat-Seng Chua

  21. [21]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Temporally and distributionally robust optimization for cold-start recom- mendation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8750–8758

  22. [22]

    Qijiong Liu, Jiaren Xiao, Lu Fan, Jieming Zhu, and Xiao-Ming Wu. 2024. Learning Category Trees for ID-Based Recommendation: Exploring the Power of Differ- entiable Vector Quantization. InProceedings of the ACM Web Conference 2024. 3521–3532

  23. [23]

    Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong

  24. [24]

    In Proceedings of the AAAI conference on artificial intelligence, Vol

    FinalMLP: an enhanced two-stream MLP model for CTR prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37. 4552–4560

  25. [25]

    Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. InProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2(Lake Tahoe, Nevada)(NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 2643–2651

  26. [26]

    Xiaofeng Pan, Jing Chen, Haitong Zhang, Menglin Xing, Jiayi Wei, Xuefeng Mu, and Zhongqian Xie. 2025. Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning. InProceedings of the 2025 International Conference on Multimedia Retrieval. 2018–2022

  27. [27]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  28. [28]

    Steffen Rendle. 2010. Factorization Machines. In2010 IEEE International Confer- ence on Data Mining. 995–1000. doi:10.1109/ICDM.2010.127

  29. [29]

    Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme

  30. [30]

    InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence(Montreal, Quebec, Canada)(UAI ’09)

    BPR: Bayesian personalized ranking from implicit feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence(Montreal, Quebec, Canada)(UAI ’09). AUAI Press, Arlington, Virginia, USA, 452–461

  31. [31]

    Rebecca Salganik, Xiaohao Liu, Yunshan Ma, Jian Kang, and Tat-Seng Chua. 2024. Larp: Language audio relational pre-training for cold-start playlist continuation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2524–2535

  32. [32]

    Igor André Pegoraro Santana, Fabio Pinhelli, Juliano Donini, Leonardo Catharin, Rafael Biazus Mangolin, Valéria Delisandra Feltrim, Marcos Aurélio Domingues, et al. 2020. Music4all: A new music database and its applications. In2020 In- ternational Conference on Systems, Signals and Image Processing (IWSSIP). IEEE, 399–404

  33. [33]

    Markus Schedl. 2016. The LFM-1b Dataset for Music Retrieval and Recommenda- tion. InProceedings of the 2016 ACM on International Conference on Multimedia Retrieval(New York, New York, USA)(ICMR ’16). Association for Computing Machinery, New York, NY, USA, 103–110. doi:10.1145/2911996.2912004

  34. [34]

    Markus Schedl, Stefan Brandl, Oleg Lesota, Emilia Parada-Cabaleiro, David Penz, and Navid Rekabsaz. 2022. LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 337–341

  35. [35]

    Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research.International Journal of Multimedia Information Retrieval7, 2 (2018), 95–116

  36. [36]

    Xiang-Rong Sheng, Feifan Yang, Litong Gong, Biao Wang, Zhangming Chan, Yujing Zhang, Yueyao Cheng, Yong-Nan Zhu, Tiezheng Ge, Han Zhu, Yuning Jiang, Jian Xu, and Bo Zheng. 2024. Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights. InProceed- ings of the 33rd ACM International Conference on Information ...

  37. [37]

    Zhen Tian, Ting Bai, Wayne Xin Zhao, Ji-Rong Wen, and Zhao Cao. 2023. Euler- net: Adaptive feature interaction learning via euler’s formula for ctr prediction. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval. 1376–1385

  38. [38]

    Qi-Wei Wang, Hongyu Lu, Yu Chen, Da-Wei Zhou, De-Chuan Zhan, Ming Chen, and Han-Jia Ye. 2023. Streaming CTR prediction: Rethinking recommendation task for real-world streaming data.arXiv preprint arXiv:2307.07509(2023)

  39. [39]

    Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. InProceedings of the ADKDD’17. 1–7

  40. [40]

    Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. InProceedings of the web conference 2021. 1785–1797

  41. [41]

    Zhiqiang Wang, Qingyun She, and Junlin Zhang. 2021. Masknet: Introducing feature-wise multiplication to CTR ranking models by instance-guided mask. arXiv preprint arXiv:2102.07619(2021)

  42. [42]

    Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 1437–1445. doi:10.1145/3343...

  43. [43]

    Minz Won, Yun-Ning Hung, and Duc Le. 2023. A foundation model for music informatics.arXiv preprint arXiv:2311.03318(2023)

  44. [44]

    Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua

  45. [45]

    InProceedings of the 26th International Joint Conference on Artificial Intelligence(Melbourne, Australia)(IJCAI’17)

    Attentional factorization machines: learning the weight of feature in- teractions via attention networks. InProceedings of the 26th International Joint Conference on Artificial Intelligence(Melbourne, Australia)(IJCAI’17). AAAI Press, 3119–3125

  46. [46]

    Lanling Xu, Zhen Tian, Gaowei Zhang, Junjie Zhang, Lei Wang, Bowen Zheng, Yifan Li, Jiakai Tang, Zeyu Zhang, Yupeng Hou, Xingyu Pan, Wayne Xin Zhao, Xu Chen, and Ji-Rong Wen. 2023. Towards a More User-Friendly and Easy-to-Use Benchmark Library for Recommender Systems. InSIGIR. ACM, 2837–2847

  47. [47]

    Xiaolong Xu, Hongsheng Dong, Lianyong Qi, Xuyun Zhang, Haolong Xiang, Xiaoyu Xia, Yanwei Xu, and Wanchun Dou. 2024. Cmclrec: Cross-modal con- trastive learning for user cold-start sequential recommendation. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1589–1598

  48. [48]

    Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, et al. 2024. Wukong: Towards a scaling law for large-scale recommendation.arXiv preprint arXiv:2403.02545(2024)

  49. [49]

    Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang

  50. [50]

    InMM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, Heng Tao Shen, Yueting Zhuang, John R

    Mining Latent Structures for Multimedia Recommendation. InProceedings of the 29th ACM International Conference on Multimedia(Virtual Event, China) (MM ’21). Association for Computing Machinery, New York, NY, USA, 3872–3880. doi:10.1145/3474085.3475259

  51. [51]

    Wayne Xin Zhao, Yupeng Hou, Xingyu Pan, Chen Yang, Zeyu Zhang, Zihan Lin, Jingsen Zhang, Shuqing Bian, Jiakai Tang, Wenqi Sun, Yushuo Chen, Lanling Xu, Gaowei Zhang, Zhen Tian, Changxin Tian, Shanlei Mu, Xinyan Fan, Xu Chen, and Ji-Rong Wen. 2022. RecBole 2.0: Towards a More Up-to-Date Recommendation Library. InCIKM. ACM, 4722–4726

  52. [52]

    Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. InCIKM. ACM, 4653–4664

  53. [53]

    Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. 2024. Adapting large language models by integrating collaborative semantics for recommendation. In2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 1435–1448

  54. [54]

    Ruiqi Zheng, Liang Qu, Tong Chen, Lizhen Cui, Yuhui Shi, and Hongzhi Yin. 2024. Decentralized collaborative learning with adaptive reference data for on-device poi recommendation. InProceedings of the ACM Web Conference 2024. 3930–3939

  55. [55]

    Xin Zhou. 2023. Mmrec: Simplifying multimodal recommendation. InProceedings of the 5th ACM International Conference on Multimedia in Asia Workshops. 1–2

  56. [56]

    Xin Zhou and Zhiqi Shen. 2023. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. InProceedings of the 31st ACM international conference on multimedia. 935–943

  57. [57]

    Yizhi Zhou, Haina Zhu, and Hangting Chen. 2025. Layer-wise Investigation of Large-Scale Self-Supervised Music Representation Models.arXiv preprint arXiv:2505.16306(2025)

  58. [58]

    Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen. 2025. MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization.arXiv preprint arXiv:2501.01108 (2025)

  59. [59]

    Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui Zhang. 2022. BARS: Towards Open Benchmarking for Recom- mender Systems. InSIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben ...

  60. [60]

    Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open Benchmarking for Click-Through Rate Prediction. InCIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, Gianluca Demartini, Guido Revisiting Content-Based Music Recommendation: Efficient...

  61. [61]

    doi:10.1145/3459637.3482486 Conference’17, July 2017, Washington, DC, USA Zhou et al. 5 Appendix A Distribution Drift in Recommendation We extracted and observed user and item information for different time periods based on the timestamp information provided by the data. For users, we directly used information such as age, gender, and total number of play...