pith. machine review for the scientific record. sign in

arxiv: 2512.21863 · v2 · submitted 2025-12-26 · 💻 cs.IR · cs.MM

Recognition: unknown

Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3

classification 💻 cs.IR cs.MM
keywords micro-video recommendationfrozen LVLMsfeature fusionintermediate hidden statesID embeddingsDual Feature Fusionmultimodal featuresvideo-language models
0
0 comments X

The pith

Intermediate hidden states from frozen LVLMs fused with item ID embeddings outperform both caption summaries and direct replacement in micro-video recommendation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first systematic comparison of ways to plug frozen large video-language models into micro-video recommenders. It tests two axes: whether to replace item IDs with LVLM outputs or fuse them, and whether to use generated captions or the model's own intermediate decoder states. Experiments show that captions discard visual details needed for accurate ranking, that ID embeddings carry unique collaborative signals that must be kept, and that fusion of selected decoder layers works best. From these patterns the authors derive a simple Dual Feature Fusion method that adaptively combines multi-layer LVLM representations with IDs. The resulting system reaches state-of-the-art accuracy on two public micro-video datasets while remaining lightweight and model-agnostic.

Core claim

Intermediate decoder hidden states from frozen LVLMs preserve fine-grained visual semantics that natural-language captions inevitably lose; because ID embeddings also supply irreplaceable collaborative signals, any strategy that replaces IDs with LVLM features underperforms; therefore a lightweight Dual Feature Fusion framework that adaptively merges selected multi-layer representations with ID embeddings delivers state-of-the-art results on two real-world micro-video recommendation benchmarks.

What carries the argument

Dual Feature Fusion (DFF) framework, a plug-and-play module that adaptively combines multi-layer decoder hidden states from a frozen LVLM with item ID embeddings.

If this is right

  • Caption generation should be avoided when the goal is ranking rather than explanation, because it discards visual cues critical for preference prediction.
  • Item ID embeddings must be retained rather than overwritten, since they encode collaborative filtering signals that multimodal features alone do not capture.
  • Different decoder layers contribute unequally, so layer selection or weighting becomes an explicit design choice for any LVLM-based recommender.
  • The fusion approach is model-agnostic and adds negligible compute, allowing existing production pipelines to adopt it without retraining the LVLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intermediate-state principle may transfer to other multimodal ranking tasks such as image or short-video search where fine visual detail matters more than textual summary.
  • Production systems could add a small validation set to automatically pick the best decoder layers per model instead of fixing them in advance.
  • If decoder-layer effectiveness proves dataset-dependent, future work might learn a lightweight router that selects or weights layers on the fly.

Load-bearing premise

The two chosen real-world micro-video datasets are representative enough that the observed advantages of intermediate states and fusion will hold on other datasets and in production systems.

What would settle it

A third independent micro-video dataset on which either caption-based extraction or ID replacement matches or exceeds the accuracy of the proposed fusion method would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.21863 by Changguang Wu, Huatuan Sun, Pengfei Wang, Xiaoyu Du, Yanxin Zhang, Yunshan Ma.

Figure 1
Figure 1. Figure 1: Framework for integrating frozen Large Video Language Models (LVLMs) into micro-video recommendation. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of layer-wise features from Video [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison of different feature extrac [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of convergence speed, training effi [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first systematic empirical study on integrating frozen Large Video Language Models (LVLMs) into micro-video recommendation. It compares two integration strategies (replacement vs. fusion with item ID embeddings) and two feature extraction paradigms (LVLM-generated captions vs. intermediate decoder hidden states) across representative LVLMs. The study derives three principles: intermediate hidden states outperform captions, fusion is superior to replacement, and layer effectiveness varies. Guided by these, it proposes the lightweight Dual Feature Fusion (DFF) framework that adaptively fuses multi-layer LVLM representations with ID embeddings and reports state-of-the-art results on two real-world micro-video benchmarks.

Significance. If the empirical patterns hold under rigorous verification, the work offers practical, plug-and-play guidelines for incorporating off-the-shelf LVLMs into recommender systems, highlighting the value of preserving fine-grained visual semantics from hidden states rather than relying on captions. The DFF module is lightweight and could be adopted readily; the identification of layer-dependent effectiveness provides a concrete design principle for future multimodal recsys work.

major comments (2)
  1. [Abstract and Experiments section] The central SOTA claim and the three derived principles rest entirely on experiments with two real-world micro-video benchmarks (mentioned in the abstract). No cross-dataset validation, sensitivity analysis to dataset scale or content distribution, or additional benchmarks are referenced, which is load-bearing for the generalization of the superiority of intermediate states and DFF fusion.
  2. [Experiments section] The abstract and experimental reporting omit error bars, statistical significance tests, full ablation tables, and details on whether layer selection or fusion weights were tuned on the same test sets used for final evaluation. This prevents verification that the reported gains are robust rather than artifacts of post-hoc choices.
minor comments (2)
  1. [Method section] Clarify the exact LVLMs, layer indices, and fusion weight parameterization in the DFF description to improve reproducibility.
  2. [Results section] Add a table summarizing the three principles with quantitative deltas across models and datasets for easier reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript while defending the core contributions on the basis of the experiments performed.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] The central SOTA claim and the three derived principles rest entirely on experiments with two real-world micro-video benchmarks (mentioned in the abstract). No cross-dataset validation, sensitivity analysis to dataset scale or content distribution, or additional benchmarks are referenced, which is load-bearing for the generalization of the superiority of intermediate states and DFF fusion.

    Authors: The two benchmarks used are standard, publicly available micro-video datasets that differ substantially in scale, user demographics, and content characteristics, providing a reasonable basis for the observed patterns. The three principles (intermediate states > captions, fusion > replacement, and layer-dependent effectiveness) hold consistently across both, which we view as supporting evidence for their broader applicability within the micro-video recommendation domain. We acknowledge that additional cross-dataset validation would further strengthen generalization claims. In the revision we will add an explicit discussion subsection on dataset representativeness, include sensitivity plots showing performance trends across data scales and sparsity levels within the existing benchmarks, and expand the limitations paragraph to note the absence of broader cross-domain testing. revision: partial

  2. Referee: [Experiments section] The abstract and experimental reporting omit error bars, statistical significance tests, full ablation tables, and details on whether layer selection or fusion weights were tuned on the same test sets used for final evaluation. This prevents verification that the reported gains are robust rather than artifacts of post-hoc choices.

    Authors: We agree that these reporting elements are important for reproducibility and robustness assessment. In the revised version we will (i) add error bars (standard deviation over multiple random seeds) to all main result tables, (ii) include statistical significance tests (paired t-tests with p-values) comparing DFF against the strongest baselines, (iii) provide complete ablation tables covering all design choices, and (iv) explicitly state that layer selection and fusion hyperparameters were tuned exclusively on the validation split, with final numbers reported on the held-out test set. These additions will be placed in the Experiments section and will not alter the original experimental protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study derives claims from measured benchmark results

full rationale

The paper conducts a systematic empirical comparison of LVLM feature extraction (captions vs. intermediate hidden states) and integration strategies (replacement vs. fusion) on two real-world micro-video recommendation benchmarks. The three principles and the DFF framework are presented as outcomes of these experiments, with final performance reported directly from the same evaluation protocol. No equations, derivations, or self-citations are invoked to force the central claims; results are obtained by direct measurement rather than by construction from fitted inputs or prior author theorems. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on standard assumptions of recommender-system evaluation (i.i.d. train/test splits, offline metrics as proxies for user satisfaction) plus the domain assumption that the chosen LVLMs are representative of current large video-language models. No new physical constants or invented particles are introduced. The DFF module itself contains a small number of learnable fusion parameters whose exact count and initialization are not specified in the abstract.

free parameters (1)
  • layer-wise fusion weights
    Adaptive fusion coefficients that combine multiple decoder layers with ID embeddings; their values are learned during training of the lightweight DFF module.
axioms (1)
  • domain assumption Offline ranking metrics on held-out user-item interactions are a valid proxy for real-world recommendation quality.
    Invoked implicitly when claiming state-of-the-art performance on the two benchmarks.
invented entities (1)
  • Dual Feature Fusion (DFF) module no independent evidence
    purpose: Lightweight plug-and-play component that adaptively fuses multi-layer LVLM representations with item ID embeddings.
    New architectural component introduced by the authors; no independent evidence outside the paper's own experiments is provided.

pith-pipeline@v0.9.0 · 5552 in / 1580 out tokens · 23725 ms · 2026-05-16T20:07:48.511123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 9 internal anchors

  1. [1]

    Haoran Chen, Junyan Lin, Xinghao Chen, Yue Fan, Jianfeng Dong, Xin Jin, Hui Su, Jinlan Fu, and Xiaoyu Shen. 2025. Multimodal Language Models See Better When They Look Shallower. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 6688–6706

  2. [2]

    Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat- Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. InProceedings of the 40th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344

  3. [3]

    Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. M6- Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems.arXiv preprint arXiv:2205.08084(2022)

  4. [4]

    Marco De Nadai, Andreas Damianou, and Mounia Lalmas. 2025. Describe What You See with Multimodal Large Language Models to Enhance Video Recom- mendations. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 1159–1163

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186

  6. [6]

    Yujuan Ding, Yunshan Ma, Wai Keung Wong, and Tat-Seng Chua. 2021. Lever- aging Two Types of Global Graph for Sequential Fashion Recommendation. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 73–81

  7. [7]

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- fast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision. 6202–6211

  8. [8]

    Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM conference on recommender systems. 299–315

  9. [9]

    Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, Peng Jiang, and Kun Gai. 2022. Real-time Short Video Recommendation on Mobile Devices. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 3103–3112

  10. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  11. [11]

    Jinkun Han, Wei Li, Zhipeng Cai, and Yingshu Li. 2022. Multi-Aggregator Time- Warping Heterogeneous Graph Neural Network for Personalized Micro-Video Recommendation. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 676–685

  12. [12]

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. De- berta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654(2020)

  13. [13]

    Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized ranking from implicit feedback. InProceedings of the AAAI conference on artificial intelligence

  14. [14]

    Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648

  15. [15]

    Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. InProceedings of the 26th international conference on world wide web. 173–182

  16. [16]

    Yingzhi He, Xiaohao Liu, An Zhang, Yunshan Ma, and Tat-Seng Chua. 2025. Llm2rec: Large language models are powerful embedding models for sequen- tial recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 896–907

  17. [17]

    Ying He, Gongqing Wu, Desheng Cai, and Xuegang Hu. 2023. Meta-path based graph contrastive learning for micro-video recommendation.Expert Systems with Applications222 (2023), 119713

  18. [18]

    Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk

  19. [19]

    Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939(2015)

  20. [20]

    Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning vector-quantized item representation for transferable sequential recommenders. InProceedings of the ACM Web Conference 2023. 1162–1171

  21. [21]

    Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 585–593

  22. [22]

    Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, and Xiang Wang. 2025. AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1614–1623

  23. [23]

    Chengkai Huang, Tong Yu, Kaige Xie, Shuai Zhang, Lina Yao, and Julian McAuley

  24. [24]

    Foundation models for recommender systems: A survey and new perspec- tives.arXiv preprint arXiv:2402.11143(2024)

  25. [25]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS)20, 4 (2002), 422–446

  26. [26]

    Hui Jiang, Wen Wang, Yinwei Wei, et al. 2020. What Aspect Do You Like: Multi- scale Time-aware User Interest Modeling for Micro-video Recommendation. In Proceedings of the 28th ACM International Conference on Multimedia. 3487–3495

  27. [27]

    Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chen- gru Song, Yuliang Liu, Di Zhang, Yang Song, et al . 2024. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161(2024). Conference’17, July 2017, Washington, DC, USA Huatuan Sun, Yunshan Ma, Changguang Wu, Yanxin ...

  28. [28]

    Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining. 197–206

  29. [29]

    Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language trans- former without convolution or region supervision. InInternational conference on machine learning. PMLR, 5583–5594

  30. [30]

    Michael A Lepori, Alexa R Tartaglini, Wai K Vong, Thomas Serre, Brenden M Lake, and Ellie Pavlick. 2024. Beyond the doors of perception: Vision transformers represent relations between objects.Advances in Neural Information Processing Systems37 (2024), 131503–131544

  31. [31]

    Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text is all you need: Learning language representations for sequential recommendation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1258–1267

  32. [32]

    Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al . 2025. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems43, 2 (2025), 1–47

  33. [33]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  34. [34]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

  35. [35]

    Yuqing Liu, Yu Wang, Lichao Sun, and Philip S Yu. 2024. Rec-gpt4v: Mul- timodal recommendation with large vision-language models.arXiv preprint arXiv:2402.08670(2024)

  36. [36]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

  37. [37]

    Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2025. A content-driven micro-video recommen- dation dataset at scale. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6486–6491

  38. [38]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Rep- resentations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237

  39. [39]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  40. [40]

    Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. InProceedings of the 10th International Conference on World Wide Web. 285–295

  41. [41]

    Yu Shang, Chen Gao, Jiansheng Chen, Depeng Jin, Meng Wang, and Yong Li

  42. [42]

    InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Learning Fine-grained User Interests for Micro-video Recommendation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 433–442

  43. [43]

    Leheng Sheng, An Zhang, Yi Zhang, Yuxin Chen, Xiang Wang, and Tat-Seng Chua. 2024. Language representations can be what recommenders need: Findings and potentials.arXiv preprint arXiv:2407.05441(2024)

  44. [44]

    Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid

  45. [45]

    InProceedings of the IEEE/CVF international conference on computer vision

    Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF international conference on computer vision. 7464– 7473

  46. [46]

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems35 (2022), 10078–10093

  47. [47]

    Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174

  48. [48]

    Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia. 1437–1445

  49. [49]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  50. [50]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2025. MiniCPM-V: A GPT-4V Level MLLM on Your Phone.Nature Communications16 (2025), 5509

  51. [51]

    Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Henmicroo Zhang, Pei- jun Zhu, Runlong Yu, Kai Zhang, and Hui Xiong. 2025. Harnessing Multi- modal Large Language Models for Multimodal Sequential Recommendation. arXiv:2408.09698 [cs.IR] https://arxiv.org/abs/2408.09698

  52. [52]

    Zixuan Yi, Xi Wang, Iadh Ounis, and Craig Macdonald. 2022. Multi-modal Graph Contrastive Learning for Micro-video Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1807–1811

  53. [53]

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. 2025. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154(2025)

  54. [54]

    Yisong Yu, Beihong Jin, Jiageng Song, Beibei Li, Yiyuan Zheng, and Wei Zhuo

  55. [55]

    In Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    Improving micro-video recommendation by controlling position bias. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 508–523

  56. [56]

    Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xi- angnan He. 2019. A simple convolutional generative network for next item recommendation. InProceedings of the twelfth ACM international conference on web search and data mining. 582–590

  57. [57]

    Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649

  58. [58]

    Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, and Enhong Chen. 2025. Notellm-2: Multimodal large representation models for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2815–2826

  59. [59]

    Jiaqi Zhang, Yu Cheng, Yongxin Ni, Yunzhu Pan, Zheng Yuan, Junchen Fu, Youhua Li, Jie Wang, and Fajie Yuan. 2025. NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation .IEEE Transactions on Pattern Analysis & Machine Intelligence(2025), 5256–5267

  60. [60]

    Bowen Zheng, Zihan Lin, Enze Liu, Chen Yang, Enyang Bai, Cheng Ling, Han Li, Wayne Xin Zhao, and Ji-Rong Wen. 2025. Enhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation. In Proceedings of the Nineteenth ACM Conference on Recommender Systems. 93–103

  61. [61]

    Yu Zheng, Chen Gao, Jingtao Ding, Lingling Yi, Depeng Jin, Yong Li, and Meng Wang. 2022. DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias. InProceedings of the 30th ACM International Conference on Multimedia. 334–345

  62. [62]

    Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, et al. 2025. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900(2025)