Recognition: unknown
Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion
Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3
The pith
Intermediate hidden states from frozen LVLMs fused with item ID embeddings outperform both caption summaries and direct replacement in micro-video recommendation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Intermediate decoder hidden states from frozen LVLMs preserve fine-grained visual semantics that natural-language captions inevitably lose; because ID embeddings also supply irreplaceable collaborative signals, any strategy that replaces IDs with LVLM features underperforms; therefore a lightweight Dual Feature Fusion framework that adaptively merges selected multi-layer representations with ID embeddings delivers state-of-the-art results on two real-world micro-video recommendation benchmarks.
What carries the argument
Dual Feature Fusion (DFF) framework, a plug-and-play module that adaptively combines multi-layer decoder hidden states from a frozen LVLM with item ID embeddings.
If this is right
- Caption generation should be avoided when the goal is ranking rather than explanation, because it discards visual cues critical for preference prediction.
- Item ID embeddings must be retained rather than overwritten, since they encode collaborative filtering signals that multimodal features alone do not capture.
- Different decoder layers contribute unequally, so layer selection or weighting becomes an explicit design choice for any LVLM-based recommender.
- The fusion approach is model-agnostic and adds negligible compute, allowing existing production pipelines to adopt it without retraining the LVLM.
Where Pith is reading between the lines
- The same intermediate-state principle may transfer to other multimodal ranking tasks such as image or short-video search where fine visual detail matters more than textual summary.
- Production systems could add a small validation set to automatically pick the best decoder layers per model instead of fixing them in advance.
- If decoder-layer effectiveness proves dataset-dependent, future work might learn a lightweight router that selects or weights layers on the fly.
Load-bearing premise
The two chosen real-world micro-video datasets are representative enough that the observed advantages of intermediate states and fusion will hold on other datasets and in production systems.
What would settle it
A third independent micro-video dataset on which either caption-based extraction or ID replacement matches or exceeds the accuracy of the proposed fusion method would falsify the central claim.
Figures
read the original abstract
Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first systematic empirical study on integrating frozen Large Video Language Models (LVLMs) into micro-video recommendation. It compares two integration strategies (replacement vs. fusion with item ID embeddings) and two feature extraction paradigms (LVLM-generated captions vs. intermediate decoder hidden states) across representative LVLMs. The study derives three principles: intermediate hidden states outperform captions, fusion is superior to replacement, and layer effectiveness varies. Guided by these, it proposes the lightweight Dual Feature Fusion (DFF) framework that adaptively fuses multi-layer LVLM representations with ID embeddings and reports state-of-the-art results on two real-world micro-video benchmarks.
Significance. If the empirical patterns hold under rigorous verification, the work offers practical, plug-and-play guidelines for incorporating off-the-shelf LVLMs into recommender systems, highlighting the value of preserving fine-grained visual semantics from hidden states rather than relying on captions. The DFF module is lightweight and could be adopted readily; the identification of layer-dependent effectiveness provides a concrete design principle for future multimodal recsys work.
major comments (2)
- [Abstract and Experiments section] The central SOTA claim and the three derived principles rest entirely on experiments with two real-world micro-video benchmarks (mentioned in the abstract). No cross-dataset validation, sensitivity analysis to dataset scale or content distribution, or additional benchmarks are referenced, which is load-bearing for the generalization of the superiority of intermediate states and DFF fusion.
- [Experiments section] The abstract and experimental reporting omit error bars, statistical significance tests, full ablation tables, and details on whether layer selection or fusion weights were tuned on the same test sets used for final evaluation. This prevents verification that the reported gains are robust rather than artifacts of post-hoc choices.
minor comments (2)
- [Method section] Clarify the exact LVLMs, layer indices, and fusion weight parameterization in the DFF description to improve reproducibility.
- [Results section] Add a table summarizing the three principles with quantitative deltas across models and datasets for easier reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point-by-point below, indicating where revisions will be made to strengthen the manuscript while defending the core contributions on the basis of the experiments performed.
read point-by-point responses
-
Referee: [Abstract and Experiments section] The central SOTA claim and the three derived principles rest entirely on experiments with two real-world micro-video benchmarks (mentioned in the abstract). No cross-dataset validation, sensitivity analysis to dataset scale or content distribution, or additional benchmarks are referenced, which is load-bearing for the generalization of the superiority of intermediate states and DFF fusion.
Authors: The two benchmarks used are standard, publicly available micro-video datasets that differ substantially in scale, user demographics, and content characteristics, providing a reasonable basis for the observed patterns. The three principles (intermediate states > captions, fusion > replacement, and layer-dependent effectiveness) hold consistently across both, which we view as supporting evidence for their broader applicability within the micro-video recommendation domain. We acknowledge that additional cross-dataset validation would further strengthen generalization claims. In the revision we will add an explicit discussion subsection on dataset representativeness, include sensitivity plots showing performance trends across data scales and sparsity levels within the existing benchmarks, and expand the limitations paragraph to note the absence of broader cross-domain testing. revision: partial
-
Referee: [Experiments section] The abstract and experimental reporting omit error bars, statistical significance tests, full ablation tables, and details on whether layer selection or fusion weights were tuned on the same test sets used for final evaluation. This prevents verification that the reported gains are robust rather than artifacts of post-hoc choices.
Authors: We agree that these reporting elements are important for reproducibility and robustness assessment. In the revised version we will (i) add error bars (standard deviation over multiple random seeds) to all main result tables, (ii) include statistical significance tests (paired t-tests with p-values) comparing DFF against the strongest baselines, (iii) provide complete ablation tables covering all design choices, and (iv) explicitly state that layer selection and fusion hyperparameters were tuned exclusively on the validation split, with final numbers reported on the held-out test set. These additions will be placed in the Experiments section and will not alter the original experimental protocol. revision: yes
Circularity Check
No circularity: empirical study derives claims from measured benchmark results
full rationale
The paper conducts a systematic empirical comparison of LVLM feature extraction (captions vs. intermediate hidden states) and integration strategies (replacement vs. fusion) on two real-world micro-video recommendation benchmarks. The three principles and the DFF framework are presented as outcomes of these experiments, with final performance reported directly from the same evaluation protocol. No equations, derivations, or self-citations are invoked to force the central claims; results are obtained by direct measurement rather than by construction from fitted inputs or prior author theorems. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- layer-wise fusion weights
axioms (1)
- domain assumption Offline ranking metrics on held-out user-item interactions are a valid proxy for real-world recommendation quality.
invented entities (1)
-
Dual Feature Fusion (DFF) module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Haoran Chen, Junyan Lin, Xinghao Chen, Yue Fan, Jianfeng Dong, Xin Jin, Hui Su, Jinlan Fu, and Xiaoyu Shen. 2025. Multimodal Language Models See Better When They Look Shallower. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 6688–6706
work page 2025
-
[2]
Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat- Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. InProceedings of the 40th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344
work page 2017
- [3]
-
[4]
Marco De Nadai, Andreas Damianou, and Mounia Lalmas. 2025. Describe What You See with Multimodal Large Language Models to Enhance Video Recom- mendations. InProceedings of the Nineteenth ACM Conference on Recommender Systems. 1159–1163
work page 2025
-
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186
work page 2019
-
[6]
Yujuan Ding, Yunshan Ma, Wai Keung Wong, and Tat-Seng Chua. 2021. Lever- aging Two Types of Global Graph for Sequential Fashion Recommendation. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 73–81
work page 2021
-
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- fast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision. 6202–6211
work page 2019
-
[8]
Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). InProceedings of the 16th ACM conference on recommender systems. 299–315
work page 2022
-
[9]
Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, Peng Jiang, and Kun Gai. 2022. Real-time Short Video Recommendation on Mobile Devices. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 3103–3112
work page 2022
-
[10]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Jinkun Han, Wei Li, Zhipeng Cai, and Yingshu Li. 2022. Multi-Aggregator Time- Warping Heterogeneous Graph Neural Network for Personalized Micro-Video Recommendation. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 676–685
work page 2022
-
[12]
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. De- berta: Decoding-enhanced bert with disentangled attention.arXiv preprint arXiv:2006.03654(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized ranking from implicit feedback. InProceedings of the AAAI conference on artificial intelligence
work page 2016
-
[14]
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648
work page 2020
-
[15]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. InProceedings of the 26th international conference on world wide web. 173–182
work page 2017
-
[16]
Yingzhi He, Xiaohao Liu, An Zhang, Yunshan Ma, and Tat-Seng Chua. 2025. Llm2rec: Large language models are powerful embedding models for sequen- tial recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 896–907
work page 2025
-
[17]
Ying He, Gongqing Wu, Desheng Cai, and Xuegang Hu. 2023. Meta-path based graph contrastive learning for micro-video recommendation.Expert Systems with Applications222 (2023), 119713
work page 2023
-
[18]
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk
-
[19]
Session-based recommendations with recurrent neural networks.arXiv preprint arXiv:1511.06939(2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning vector-quantized item representation for transferable sequential recommenders. InProceedings of the ACM Web Conference 2023. 1162–1171
work page 2023
-
[21]
Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 585–593
work page 2022
-
[22]
Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, and Xiang Wang. 2025. AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1614–1623
work page 2025
-
[23]
Chengkai Huang, Tong Yu, Kaige Xie, Shuai Zhang, Lina Yao, and Julian McAuley
- [24]
-
[25]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS)20, 4 (2002), 422–446
work page 2002
-
[26]
Hui Jiang, Wen Wang, Yinwei Wei, et al. 2020. What Aspect Do You Like: Multi- scale Time-aware User Interest Modeling for Micro-video Recommendation. In Proceedings of the 28th ACM International Conference on Multimedia. 3487–3495
work page 2020
-
[27]
Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chen- gru Song, Yuliang Liu, Di Zhang, Yang Song, et al . 2024. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization.arXiv preprint arXiv:2402.03161(2024). Conference’17, July 2017, Washington, DC, USA Huatuan Sun, Yunshan Ma, Changguang Wu, Yanxin ...
-
[28]
Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- mendation. In2018 IEEE international conference on data mining. 197–206
work page 2018
-
[29]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language trans- former without convolution or region supervision. InInternational conference on machine learning. PMLR, 5583–5594
work page 2021
-
[30]
Michael A Lepori, Alexa R Tartaglini, Wai K Vong, Thomas Serre, Brenden M Lake, and Ellie Pavlick. 2024. Beyond the doors of perception: Vision transformers represent relations between objects.Advances in Neural Information Processing Systems37 (2024), 131503–131544
work page 2024
-
[31]
Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian McAuley. 2023. Text is all you need: Learning language representations for sequential recommendation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1258–1267
work page 2023
-
[32]
Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al . 2025. How can recommender systems benefit from large language models: A survey.ACM Transactions on Information Systems43, 2 (2025), 1–47
work page 2025
-
[33]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [35]
-
[36]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2025. A content-driven micro-video recommen- dation dataset at scale. InProceedings of the 34th ACM International Conference on Information and Knowledge Management. 6486–6491
work page 2025
-
[38]
Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Rep- resentations. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237
work page 2018
-
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[40]
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. InProceedings of the 10th International Conference on World Wide Web. 285–295
work page 2001
-
[41]
Yu Shang, Chen Gao, Jiansheng Chen, Depeng Jin, Meng Wang, and Yong Li
-
[42]
Learning Fine-grained User Interests for Micro-video Recommendation. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 433–442
- [43]
-
[44]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid
-
[45]
InProceedings of the IEEE/CVF international conference on computer vision
Videobert: A joint model for video and language representation learning. InProceedings of the IEEE/CVF international conference on computer vision. 7464– 7473
-
[46]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems35 (2022), 10078–10093
work page 2022
-
[47]
Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174
work page 2019
-
[48]
Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia. 1437–1445
work page 2019
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2025. MiniCPM-V: A GPT-4V Level MLLM on Your Phone.Nature Communications16 (2025), 5509
work page 2025
- [51]
-
[52]
Zixuan Yi, Xi Wang, Iadh Ounis, and Craig Macdonald. 2022. Multi-modal Graph Contrastive Learning for Micro-video Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1807–1811
work page 2022
-
[53]
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. 2025. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Yisong Yu, Beihong Jin, Jiageng Song, Beibei Li, Yiyuan Zheng, and Wei Zhuo
-
[55]
In Joint European Conference on Machine Learning and Knowledge Discovery in Databases
Improving micro-video recommendation by controlling position bias. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 508–523
-
[56]
Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xi- angnan He. 2019. A simple convolutional generative network for next item recommendation. InProceedings of the twelfth ACM international conference on web search and data mining. 582–590
work page 2019
-
[57]
Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649
work page 2023
-
[58]
Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, and Enhong Chen. 2025. Notellm-2: Multimodal large representation models for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2815–2826
work page 2025
-
[59]
Jiaqi Zhang, Yu Cheng, Yongxin Ni, Yunzhu Pan, Zheng Yuan, Junchen Fu, Youhua Li, Jie Wang, and Fajie Yuan. 2025. NineRec: A Benchmark Dataset Suite for Evaluating Transferable Recommendation .IEEE Transactions on Pattern Analysis & Machine Intelligence(2025), 5256–5267
work page 2025
-
[60]
Bowen Zheng, Zihan Lin, Enze Liu, Chen Yang, Enyang Bai, Cheng Ling, Han Li, Wayne Xin Zhao, and Ji-Rong Wen. 2025. Enhancing Sequential Recommender with Large Language Models for Joint Video and Comment Recommendation. In Proceedings of the Nineteenth ACM Conference on Recommender Systems. 93–103
work page 2025
-
[61]
Yu Zheng, Chen Gao, Jingtao Ding, Lingling Yi, Depeng Jin, Yong Li, and Meng Wang. 2022. DVR: Micro-Video Recommendation Optimizing Watch-Time-Gain under Duration Bias. InProceedings of the 30th ACM International Conference on Multimedia. 334–345
work page 2022
-
[62]
Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, et al. 2025. Onerec-v2 technical report.arXiv preprint arXiv:2508.20900(2025)
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.