pith. sign in

arxiv: 1907.02203 · v1 · pith:NAW5SO4Hnew · submitted 2019-07-04 · 💻 cs.IR · cs.LG

An Item Recommendation Approach by Fusing Images based on Neural Networks

Pith reviewed 2026-05-25 09:34 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords item recommendationneural collaborative filteringimage fusionconvolutional neural networkmatrix factorizationmulti-layer perceptronvisual featuresRMSE
0
0 comments X

The pith

Incorporating visual features from images into a neural recommendation model improves prediction accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce MF-VMLP, a model that extracts visual features from item images using a pre-trained convolutional neural network and fuses them with user and item latent factors using a multi-layer perceptron. This fusion is combined with matrix factorization to make preference predictions. The approach aims to account for how an item's appearance influences user choices beyond ratings alone. Experiments on an Amazon dataset show that this method reduces root-mean-square error compared to models without visual information. If correct, it means recommendation systems can leverage image data to make more accurate suggestions for items where looks matter.

Core claim

The paper presents MF-VMLP, which obtains visual representations via a pre-trained CNN, uses an MLP to learn nonlinear interactions between latent vectors and visual vectors, and combines MF and MLP for collaborative filtering. Experiments on Amazon's public dataset using RMSE demonstrate that the model boosts recommendation performance.

What carries the argument

MF-VMLP model that fuses CNN-extracted visual features with matrix factorization and multi-layer perceptron for nonlinear combination.

If this is right

  • Visual characteristics of items can be used to predict user preferences.
  • The combination of MF and MLP achieves collaborative filtering that incorporates images.
  • The model shows improved performance on real-world data as measured by lower RMSE.
  • Item images provide additional information not captured by ratings or text alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • For product categories like fashion or home goods, visual data may be particularly valuable for recommendations.
  • The method could be extended by training the CNN on domain-specific images rather than using a pre-trained general model.
  • Similar fusion techniques might apply to other data types such as video or audio for items.

Load-bearing premise

The visual features from the pre-trained CNN capture meaningful item characteristics that influence user preferences in a way that can be combined with latent factors.

What would settle it

If adding the visual features to the model on the Amazon dataset does not result in a lower RMSE value than the version without images.

Figures

Figures reproduced from arXiv: 1907.02203 by Lin Li, Weibin Lin.

Figure 1
Figure 1. Figure 1: VMLP model input of user and item into low-dimensional embedding. The same as VMF, we need to reduce the dimensional of original image features. To address this issue, we propose to add hidden layers on the concatenated vector. In order to integrate the latent vectors of items and image vector, we concatenate these vectors into item enhanced factor. However, to learn and predict users’ preferences for item… view at source ↗
Figure 2
Figure 2. Figure 2: MF-VMLP model VMLP based on NCF framework, so as to learn the user-item interactions better. There are two possible ways to solve the issue. Firstly, one of the easiest ways to work is to share the same input and embedding layers between them, and then combine the outputs of their interaction functions. However, the performance of the fused model might be limited by sharing embedding layers. Once sharing e… view at source ↗
Figure 3
Figure 3. Figure 3: Experimental comparison consider that the functional items are not much different in appearance, such as phones, which play a little role on the model. More importantly, neural networks have a large impact on the models. E. Conclusion And Future Work The recommendation system combined with deep learning has become a hot research topic at present. With the rapid development of deep learning, image informati… view at source ↗
read the original abstract

There are rich formats of information in the network, such as rating, text, image, and so on, which represent different aspects of user preferences. In the field of recommendation, how to use those data effectively has become a difficult subject. With the rapid development of neural network, researching on multi-modal method for recommendation has become one of the major directions. In the existing recommender systems, numerical rating, item description and review are main information to be considered by researchers. However, the characteristics of the item may affect the user's preferences, which are rarely used for recommendation models. In this work, we propose a novel model to incorporate visual factors into predictors of people's preferences, namely MF-VMLP, based on the recent developments of neural collaborative filtering (NCF). Firstly, we get visual presentation via a pre-trained convolutional neural network (CNN) model. To obtain the nonlinearities interaction of latent vectors and visual vectors, we propose to leverage a multi-layer perceptron (MLP) to learn. Moreover, the combination of MF and MLP has achieved collaborative filtering recommendation between users and items. Our experiments conduct Amazon's public dataset for experimental validation and root-mean-square error (RMSE) as evaluation metrics. To some extent, experimental result on a real-world data set demonstrates that our model can boost the recommendation performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MF-VMLP, a neural collaborative filtering model extending NCF by extracting visual features from items via a pre-trained CNN and fusing them with user/item latent factors through an MLP to capture nonlinear interactions; it reports RMSE results on an Amazon public dataset and claims that incorporating these visual factors boosts recommendation performance.

Significance. If the performance gains are shown to hold under standard controls, the work would contribute evidence that pre-trained visual embeddings can be usefully combined with latent factors in multimodal recommendation, extending the NCF framework to image data.

major comments (3)
  1. [§4] §4 (Experiments): The reported RMSE improvement is presented without any baseline comparisons (e.g., standard MF, NCF, or other multimodal models), dataset statistics, train/test split details, or negative sampling protocol, making it impossible to verify whether the claimed boost is attributable to visual fusion rather than architecture or evaluation choices.
  2. [§4] §4 (Experiments): No ablation isolating the visual component (e.g., MF-VMLP vs. MF+MLP without images) is provided, so the central claim that visual features from the CNN meaningfully affect user preferences cannot be assessed.
  3. [§3] §3 (Model): The exact dimensions of the visual vectors, MLP layer widths/depths, loss function, and optimization procedure for fusing visual and latent vectors are not specified, leaving the implementation of the claimed nonlinear interaction underspecified.
minor comments (2)
  1. [Abstract] Abstract contains grammatical issues (e.g., 'get visual presentation via' should read 'obtain visual representations using'; 'conduct Amazon's public dataset' should read 'conduct experiments on Amazon's public dataset').
  2. [§3] Notation for latent factors and visual vectors is introduced without consistent symbols or a clear diagram of the fusion architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the experimental section and model description require additional details for reproducibility and to substantiate the claims. We will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported RMSE improvement is presented without any baseline comparisons (e.g., standard MF, NCF, or other multimodal models), dataset statistics, train/test split details, or negative sampling protocol, making it impossible to verify whether the claimed boost is attributable to visual fusion rather than architecture or evaluation choices.

    Authors: We acknowledge that the current experimental reporting lacks these essential details. In the revised manuscript we will add baseline comparisons against standard MF, the original NCF, and at least one other multimodal model; report dataset statistics (users, items, ratings); specify the train/test split procedure; and describe the negative sampling protocol. These additions will allow readers to assess whether gains are due to visual fusion. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation isolating the visual component (e.g., MF-VMLP vs. MF+MLP without images) is provided, so the central claim that visual features from the CNN meaningfully affect user preferences cannot be assessed.

    Authors: We agree an ablation is required to isolate the visual component. We will add results comparing the full MF-VMLP model against an MF+MLP variant that omits the CNN-derived visual vectors, thereby demonstrating the contribution of the visual features. revision: yes

  3. Referee: [§3] §3 (Model): The exact dimensions of the visual vectors, MLP layer widths/depths, loss function, and optimization procedure for fusing visual and latent vectors are not specified, leaving the implementation of the claimed nonlinear interaction underspecified.

    Authors: We will expand Section 3 to specify the visual vector dimension produced by the pre-trained CNN, the exact widths and depths of the MLP layers, the loss function employed, and the optimization procedure used to fuse the visual and latent vectors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model proposal with independent validation

full rationale

The paper proposes MF-VMLP by extracting visual features via pre-trained CNN then combining latent factors with MLP on top of matrix factorization. The performance claim rests on RMSE measured on Amazon dataset experiments. No equations, derivations, or predictions are presented that reduce to fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes appear in the provided text. The result is an empirical demonstration on external real-world data and is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that visual item features are predictive of preference and on standard neural-network training assumptions; no new entities are postulated.

free parameters (2)
  • latent factor dimension
    Embedding size for users and items, chosen during model design
  • MLP layer widths and depths
    Architecture parameters selected to model interactions between latent and visual vectors
axioms (1)
  • domain assumption Visual characteristics of items influence user preferences independently of numerical ratings
    Invoked in the abstract as the reason for adding image data

pith-pipeline@v0.9.0 · 5758 in / 1202 out tokens · 47725 ms · 2026-05-25T09:34:01.772441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    A neural collaborative filtering model with interaction-based neighborhood

    Ting Bai, Ji-Rong Wen, Jun Zhang, and Wayne Xin Zhao. A neural collaborative filtering model with interaction-based neighborhood. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1979–1982. ACM, 2017

  2. [2]

    Topicmf: Simultaneously exploiting ratings and reviews for recommendation

    Yang Bao, Hui Fang, and Jie Zhang. Topicmf: Simultaneously exploiting ratings and reviews for recommendation. In AAAI, volume 14, pages 2–8, 2014

  3. [3]

    A generic coordinate descent framework for learning from implicit feedback

    Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web , pages 1341–1350. International World Wide Web Conferences Steering Committee, 2017

  4. [4]

    Latent cross: Making use of context in recurrent rec- ommender systems

    Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. Latent cross: Making use of context in recurrent rec- ommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 46–54. ACM, 2018

  5. [5]

    Hybrid recommender systems: Survey and experiments

    Robin Burke. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction , 12(4):331–370, 2002

  6. [6]

    Hybrid web recommender systems

    Robin Burke. Hybrid web recommender systems. In The adaptive web , pages 377–408. Springer, 2007

  7. [7]

    Aˆ 3ncf: An adaptive aspect attention model for rating prediction

    Zhiyong Cheng, Ying Ding, Xiangnan He, Lei Zhu, Xuemeng Song, and Mohan S Kankanhalli. Aˆ 3ncf: An adaptive aspect attention model for rating prediction. In IJCAI, pages 3748–3754, 2018

  8. [8]

    A unified approach to building hybrid recommender systems

    Asela Gunawardana and Christopher Meek. A unified approach to building hybrid recommender systems. In Proceedings of the third ACM conference on Recommender systems , pages 117–124. ACM, 2009

  9. [9]

    Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering

    Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web , pages 507–517. International World Wide Web Conferences Steering Committee, 2016

  10. [10]

    Vbpr: Visual bayesian personalized ranking from implicit feedback

    Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. In AAAI, pages 144–150, 2016

  11. [11]

    Neural collaborative filtering

    Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web , pages 173–182. International World Wide Web Conferences Steering Committee, 2017

  12. [12]

    Fast matrix factorization for online recommendation with implicit feedback

    Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , pages 549–558. ACM, 2016

  13. [13]

    Caffe: Convolutional architecture for fast feature embedding

    Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia , pages 675–

  14. [14]

    Matrix factorization techniques for recommender systems

    Yehuda Koren, Robert Bell, and Chris V olinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009

  15. [15]

    Content-based collaborative filtering for news topic recommendation

    Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang. Content-based collaborative filtering for news topic recommendation. In AAAI, pages 217–223, 2015

  16. [16]

    Image-based recommendations on styles and substitutes

    Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 43–52. ACM, 2015

  17. [17]

    Content-based recommendation systems

    Michael J Pazzani and Daniel Billsus. Content-based recommendation systems. In The adaptive web , pages 325–341. Springer, 2007

  18. [18]

    Combining heterogenous social and geographical information for event recommendation

    Zhi Qiao, Peng Zhang, Yanan Cao, Chuan Zhou, Li Guo, and Binxing Fang. Combining heterogenous social and geographical information for event recommendation. In AAAI, volume 14, pages 145–151, 2014

  19. [19]

    Bpr: Bayesian personalized ranking from implicit feedback

    Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence , pages 452–461. AUAI Press, 2009

  20. [20]

    Imagenet large scale visual recognition challenge

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252, 2015

  21. [21]

    Restricted boltzmann machines for collaborative filtering

    Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning , pages 791–798. ACM, 2007

  22. [22]

    A survey of collaborative filtering techniques

    Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in artificial intelligence , 2009, 2009

  23. [23]

    Rating-boosted latent topics: Understanding users and items with ratings and reviews

    Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. Rating-boosted latent topics: Understanding users and items with ratings and reviews. In IJCAI, pages 2640–2646, 2016

  24. [24]

    Effective multi- query expansions: Robust landmark retrieval

    Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi- query expansions: Robust landmark retrieval. In Proceedings of the 23rd ACM international conference on Multimedia, pages 79–88. ACM, 2015

  25. [25]

    Effective multi- query expansions: Collaborative deep networks for robust landmark retrieval

    Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi- query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Transactions on Image Processing , 26(3):1393–1404, 2017

  26. [26]

    Robust subspace clustering for multi-view data by ex- ploiting correlation consensus

    Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang. Robust subspace clustering for multi-view data by ex- ploiting correlation consensus. IEEE Transactions on Image Processing, 24(11):3939–3949, 2015

  27. [27]

    Multiview spectral clustering via structured low-rank matrix factorization

    Yang Wang, Lin Wu, Xuemin Lin, and Junbin Gao. Multiview spectral clustering via structured low-rank matrix factorization. IEEE transac- tions on neural networks and learning systems , (99):1–11, 2018

  28. [28]

    Iterative Views Agreement: An Iterative Low-Rank based Structured Optimization Method to Multi-View Spectral Clustering

    Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan. Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering. arXiv preprint arXiv:1608.05560, 2016

  29. [29]

    Deep attention-based spatially recursive networks for fine-grained visual recognition

    Lin Wu, Yang Wang, Xue Li, and Junbin Gao. Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE transactions on cybernetics , (99):1–12, 2018

  30. [30]

    3-d personvlad: Learning deep global representations for video-based person reidenti- fication

    Lin Wu, Yang Wang, Ling Shao, and Meng Wang. 3-d personvlad: Learning deep global representations for video-based person reidenti- fication. IEEE transactions on neural networks and learning systems , 2019

  31. [31]

    Col- laborative denoising auto-encoders for top-n recommender systems

    Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. Col- laborative denoising auto-encoders for top-n recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining , pages 153–162. ACM, 2016

  32. [32]

    Collaborative multi-level embedding learning from reviews for rating prediction

    Wei Zhang, Quan Yuan, Jiawei Han, and Jianyong Wang. Collaborative multi-level embedding learning from reviews for rating prediction. In IJCAI, pages 2986–2992, 2016

  33. [33]

    Joint deep modeling of users and items using reviews for recommendation

    Lei Zheng, Vahid Noroozi, and Philip S Yu. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 425–434. ACM, 2017