Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Jing Ma; Jungong Han; Pinghua Gong; Shijie Yang; Tianlu Zhang; Weijie Ding; Yanlong Zang; Yufei Guo

arxiv: 2605.17366 · v1 · pith:Z4X4O7C5new · submitted 2026-05-17 · 💻 cs.IR

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Yufei Guo , Jing Ma , Tianlu Zhang , Shijie Yang , Yanlong Zang , Weijie Ding , Pinghua Gong , Jungong Han This is my paper

Pith reviewed 2026-05-19 23:04 UTC · model grok-4.3

classification 💻 cs.IR

keywords multimodal recommendationvisual representation learninge-commerce retrievaltext-guided learningQ-Formeritem-to-item recommendationrobust embeddingsnoisy images

0 comments

The pith

TGQ-Former uses metadata as text guidance to extract robust visual tokens from cluttered product images for e-commerce retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for learning visual representations that remain effective even when real-world product photos contain promotional overlays and background clutter. It treats structured item metadata as semantic guidance that steers a connector to pull out relevant visual tokens while still keeping complementary image details. The approach separates the visual stream into metadata-anchored and exploratory parts, then uses a modulation module to adjust their balance when noise is present. Experiments on large-scale e-commerce datasets with full-pool retrieval show consistent gains over existing connector methods and full multimodal language models. A reader would care because better handling of noisy images directly improves item-to-item recommendation quality in practical retail settings.

Core claim

The central claim is that a Text-Guided Q-Former (TGQ-Former) with a hybrid-query connector disentangles metadata-anchored and exploratory visual streams, while a reliability-aware dual-gated vector modulation module adaptively calibrates their contributions under noisy inputs, thereby producing more robust multimodal item embeddings than prior connector baselines or end-to-end multimodal language models.

What carries the argument

Hybrid-query connector that disentangles metadata-anchored and exploratory visual streams, paired with a lightweight reliability-aware dual-gated vector modulation module for adaptive calibration.

If this is right

TGQ-Former improves Hit Rate@100 by 6.04% on average over strong connector baselines and end-to-end MLLMs on large-scale real-world e-commerce datasets.
The method maintains retrieval gains while handling spurious visual cues from promotional overlays and background clutter.
It preserves complementary visual evidence that would otherwise be lost when relying solely on metadata or frozen vision encoders.
The framework works with full-pool retrieval, showing effectiveness beyond limited candidate sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-guidance principle could be tested on other noisy-image domains such as user-generated photos in social recommendation.
Replacing the current modulation module with learned reliability predictors might further reduce sensitivity to metadata quality.
The hybrid-query design suggests a general pattern for any connector that must balance guided and open-ended feature extraction.

Load-bearing premise

Structured metadata is accurate and sufficient to serve as reliable semantic guidance that lets the connector separate visual streams without discarding useful image evidence.

What would settle it

Running the same full-pool retrieval experiments on the same datasets but with deliberately corrupted or missing metadata fields and checking whether the reported Hit Rate@100 gains over baselines disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.17366 by Jing Ma, Jungong Han, Pinghua Gong, Shijie Yang, Tianlu Zhang, Weijie Ding, Yanlong Zang, Yufei Guo.

**Figure 1.** Figure 1: Illustration of different strategies for using LLMs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the proposed Text-Guided Visual Representation Learning framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-attention visualization of different query [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples from poster-style listings with [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate their contributions under noisy inputs. Experiments on large-scale, real-world e-commerce datasets with full-pool retrieval show that TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs. On average, it improves Hit Rate@100 (H@100) by 6.04%, demonstrating the effectiveness of text-guided visual encoding for robust multimodal retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TGQ-Former adds a hybrid-query connector and gated modulation to Q-Former for metadata-guided visual features in e-commerce retrieval, but the 6% hit-rate claim rests on thin experimental detail.

read the letter

The paper's main contribution is TGQ-Former, which routes structured metadata through a hybrid-query connector to separate metadata-anchored visual tokens from exploratory ones, then uses a reliability-aware dual-gated module to adjust their mix when inputs are noisy. It reports a 6.04% average H@100 lift over connector baselines and end-to-end MLLMs on large-scale e-commerce data with full-pool retrieval. That is the concrete extension on top of existing Q-Former work. The framing is direct: product images carry promotional clutter that hurts retrieval, and metadata can serve as semantic guidance without discarding useful visual evidence. The reliability module is a reasonable attempt to make the approach robust rather than brittle. These pieces address a genuine deployment issue in MLRM-style pipelines. The soft spots are in the supporting evidence. The abstract states consistent outperformance but supplies no dataset sizes, ablation breakdowns, baseline reimplementation details, or statistical significance tests. Without those, it is difficult to attribute the gains specifically to the new connector and modulation rather than training differences or data quirks. The central assumption that metadata is accurate enough to drive reliable disentanglement also lacks any reported sensitivity check; if metadata is incomplete or erroneous, the hybrid streams could mix in the wrong signals. This is a practical gap rather than a fatal one. The work is aimed at applied teams building multimodal item-to-item retrieval for retail platforms. Readers who need incremental, domain-targeted fixes to visual token aggregation will find usable architecture ideas here. It shows clear engagement with the problem and the literature on connectors, so it deserves a serious referee to verify the experiments and test the metadata robustness claim. I would send it out for review with requests for ablations and error analysis rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework for robust multimodal item-to-item retrieval in e-commerce. It employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams from product images, augmented by a reliability-aware dual-gated vector modulation module to calibrate contributions under noisy inputs such as promotional overlays. Experiments on large-scale real-world e-commerce datasets with full-pool retrieval report consistent outperformance over strong connector baselines and end-to-end MLLMs, with an average 6.04% improvement in Hit Rate@100.

Significance. If the results hold under rigorous validation, the approach offers a practical advance for handling visual noise in multimodal e-commerce retrieval by leveraging structured metadata for guided token selection while preserving complementary visual evidence. The hybrid-query design and modulation module provide a targeted architectural response to a recurring deployment challenge, with potential for improved recommendation robustness.

major comments (2)

[Experiments] Experiments section: the central claim of a 6.04% average H@100 lift is presented without reported dataset sizes, statistical significance tests (e.g., paired t-tests or bootstrap intervals), ablation controls isolating the hybrid-query connector, or exact baseline re-implementations, leaving the empirical support for outperformance only partially substantiated.
[Method] Method description of the hybrid-query connector and reliability-aware dual-gated vector modulation module: the disentanglement of metadata-anchored and exploratory streams is asserted to succeed without discarding useful visual evidence, yet no quantitative sensitivity analysis to metadata perturbations, error rates, or completeness is supplied; this assumption is load-bearing for the robustness claims given the module's role in handling noisy inputs.

minor comments (2)

[Abstract] Abstract: the phrase 'large-scale, real-world e-commerce datasets' would benefit from a parenthetical note on approximate item counts or domain characteristics to orient readers.
[Method] Notation for the dual-gated modulation: an explicit equation defining the gating functions and their interaction with the two visual streams would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will incorporate revisions to strengthen the empirical and methodological sections.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of a 6.04% average H@100 lift is presented without reported dataset sizes, statistical significance tests (e.g., paired t-tests or bootstrap intervals), ablation controls isolating the hybrid-query connector, or exact baseline re-implementations, leaving the empirical support for outperformance only partially substantiated.

Authors: We agree that these details are necessary for rigorous validation. In the revised manuscript we will report the exact sizes of the large-scale e-commerce datasets (number of items, queries, and retrieval pool), include statistical significance tests such as paired t-tests or bootstrap intervals on the H@100 improvements, add ablation studies that isolate the hybrid-query connector, and specify the precise re-implementations and hyper-parameters of all baselines. These additions will provide fuller substantiation for the reported gains. revision: yes
Referee: [Method] Method description of the hybrid-query connector and reliability-aware dual-gated vector modulation module: the disentanglement of metadata-anchored and exploratory streams is asserted to succeed without discarding useful visual evidence, yet no quantitative sensitivity analysis to metadata perturbations, error rates, or completeness is supplied; this assumption is load-bearing for the robustness claims given the module's role in handling noisy inputs.

Authors: We acknowledge that a quantitative sensitivity analysis would strengthen the robustness claims. We will add such an analysis to the revised manuscript, systematically introducing controlled perturbations to metadata (varying noise levels, error rates, and completeness) and measuring the resulting impact on token selection and retrieval performance. This will empirically demonstrate that the hybrid-query design preserves complementary visual evidence under imperfect metadata conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of new architecture on external datasets

full rationale

The paper introduces TGQ-Former as a new text-guided visual representation framework with a hybrid-query connector and reliability-aware dual-gated modulation module. Performance claims (e.g., 6.04% H@100 improvement) are presented as results of experiments on large-scale real-world e-commerce datasets with full-pool retrieval, compared against baselines and end-to-end MLLMs. No equations, fitted parameters, or self-citations are shown to reduce the central claims to inputs by construction. The derivation chain consists of architectural design choices justified by the problem of spurious visual cues in product images, followed by empirical validation. This is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or self-citation load-bearing patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the empirical effectiveness of newly introduced architectural modules whose value is demonstrated through experiments rather than derived from first principles. No explicit numerical free parameters are stated in the abstract.

axioms (1)

domain assumption Structured metadata provides reliable semantic guidance for visual token extraction.
Invoked when the framework is described as leveraging metadata to disentangle visual streams.

invented entities (3)

Text-Guided Q-Former (TGQ-Former) no independent evidence
purpose: Framework for text-guided visual representation learning in noisy e-commerce images.
Newly proposed connector architecture.
Hybrid-query connector no independent evidence
purpose: Disentangles metadata-anchored and exploratory visual streams.
Core component of the proposed method.
Reliability-aware dual-gated vector modulation module no independent evidence
purpose: Adaptively calibrates contributions of visual streams under noisy inputs.
Lightweight module introduced to handle noise.

pith-pipeline@v0.9.0 · 5749 in / 1471 out tokens · 38267 ms · 2026-05-19T23:04:45.034157+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022
[3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval...

work page doi:10.1145/3331184.3331254 2019
[6]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer- Verlag, Berlin, Heidelberg, 104–120. doi:10....

work page doi:10.1007/978-3-030-58577-8_7 2020
[7]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems(Boston, Massachusetts, USA)(RecSys ’16). Association for Computing Machinery, New York, NY, USA, 191–198. doi:10.1145/2959100. 2959190

work page doi:10.1145/2959100 2016
[8]

Nilotpal Das, Aniket Joshi, Promod Yenigalla, and Gourav Agrwal. 2022. MAPS: Multimodal Attention for Product Similarity. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 3551–3560

work page 2022
[9]

Xiuqi Deng, Lu Xu, Xiyao Li, Jinkai Yu, Erpeng Xue, Zhongyuan Wang, Di Zhang, Zhaojie Liu, Guorui Zhou, Yang Song, et al. 2024. End-to-End Training of Multimodal Model and Ranking Model.arXiv preprint arXiv:2404.06078(2024)

work page arXiv 2024
[10]

Karan Desai and Justin Johnson. 2021. VirTex: Learning Visual Representations from Textual Annotations. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 11157–11168. doi:10.1109/CVPR46437.2021.01101

work page doi:10.1109/cvpr46437.2021.01101 2021
[11]

Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2023. FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2023), 2669–2680

work page 2023
[12]

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2024. Large Language Models are Zero-Shot Rankers for Recommender Systems. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II(Glasgow, United Kingdom). Springe...

work page doi:10.1007/978-3-031-56060-6_24 2024
[13]

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2022. Perceiver IO: A General Architecture for Structured In- puts & Outputs. InInternational Conference on Learning Representations (ICLR), Vol. abs/2107.14795

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2021. Perceiver: General Perception with Iterative Attention. InProceedings of the 38th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4651–4664. http://proceedi...

work page 2021
[15]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Super- vision. InProceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Ma...

work page 2021
[16]

Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2023. Bootstrapping vision- language learning with decoupled language pre-training. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 4, 16 pages

work page 2023
[17]

Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. 2023. Learning Instance- Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 11060–11069

work page 2023
[18]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs.CoRRabs/1702.08734 (2017). http://dblp.uni-trier.de/db/ journals/corr/corr1702.html#JohnsonDJ17

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023
[20]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao

work page
[21]

InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom)

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 121–137. doi:10.1007/978-3-030-58577-8_8

work page doi:10.1007/978-3-030-58577-8_8 2020
[22]

Zihan Liang, Yufei Ma, Zhipeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. 2025. UniECS: Unified Multimodal E- Commerce Search Framework with Gated Cross-modal Fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (Seoul, Republic of Korea)(CIKM ’25). Association for Comput...

work page doi:10.1145/3746252.3761170 2025
[23]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306

work page 2024
[24]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https://proceedings.neurips.cc/paper_ files/paper/2023/file/6dcf277ea32ce3288914faf369fe6...

work page 2023
[25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July ...

work page 2021
[26]

Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. 2021. Quality metrics in recommender systems: Do we calculate metrics consistently?. InProceedings of the 15th ACM conference on recommender systems. 708–713

work page 2021
[27]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 1437–1445. doi:10.1145/3343...

work page doi:10.1145/3343031.3351034 2019
[29]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A survey on large language models for recommendation.World Wide Web27, 5 (Aug. 2024), 31 pages. doi:10.1007/s11280-024-01291-2

work page doi:10.1007/s11280-024-01291-2 2024
[30]

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Wei Wang, Xiping Hu, Steven Hoi, and Edith C. H. Ngai. 2025. A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions.ArXivabs/2502.15711 (2025)

work page arXiv 2025
[31]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. 2022. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335(2022)

work page arXiv 2022
[33]

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large- scale language models.arXiv preprint arXiv:2309.10305(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Xiaoyong Yang, Yadong Zhu, Yi Zhang, Xiaobo Wang, and Quan Yuan. 2020. Large scale product graph construction for recommendation in e-commerce. arXiv preprint arXiv:2010.05525(2020)

work page arXiv 2020
[35]

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A New Foundation Model for Computer Vision.ArXiva...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, and Enhong Chen. 2025. Notellm-2: Multimodal large representation models for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2815–2826

work page 2025
[37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Choi Yejin, and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5579–5588

work page 2021
[38]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Model Details We present the details of the three LLMs employed in our method in Table 5, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022

[3] [3]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval...

work page doi:10.1145/3331184.3331254 2019

[6] [6]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer- Verlag, Berlin, Heidelberg, 104–120. doi:10....

work page doi:10.1007/978-3-030-58577-8_7 2020

[7] [7]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems(Boston, Massachusetts, USA)(RecSys ’16). Association for Computing Machinery, New York, NY, USA, 191–198. doi:10.1145/2959100. 2959190

work page doi:10.1145/2959100 2016

[8] [8]

Nilotpal Das, Aniket Joshi, Promod Yenigalla, and Gourav Agrwal. 2022. MAPS: Multimodal Attention for Product Similarity. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 3551–3560

work page 2022

[9] [9]

Xiuqi Deng, Lu Xu, Xiyao Li, Jinkai Yu, Erpeng Xue, Zhongyuan Wang, Di Zhang, Zhaojie Liu, Guorui Zhou, Yang Song, et al. 2024. End-to-End Training of Multimodal Model and Ranking Model.arXiv preprint arXiv:2404.06078(2024)

work page arXiv 2024

[10] [10]

Karan Desai and Justin Johnson. 2021. VirTex: Learning Visual Representations from Textual Annotations. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 11157–11168. doi:10.1109/CVPR46437.2021.01101

work page doi:10.1109/cvpr46437.2021.01101 2021

[11] [11]

Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2023. FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2023), 2669–2680

work page 2023

[12] [12]

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2024. Large Language Models are Zero-Shot Rankers for Recommender Systems. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II(Glasgow, United Kingdom). Springe...

work page doi:10.1007/978-3-031-56060-6_24 2024

[13] [13]

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2022. Perceiver IO: A General Architecture for Structured In- puts & Outputs. InInternational Conference on Learning Representations (ICLR), Vol. abs/2107.14795

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2021. Perceiver: General Perception with Iterative Attention. InProceedings of the 38th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4651–4664. http://proceedi...

work page 2021

[15] [15]

Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Super- vision. InProceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Ma...

work page 2021

[16] [16]

Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2023. Bootstrapping vision- language learning with decoupled language pre-training. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 4, 16 pages

work page 2023

[17] [17]

Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. 2023. Learning Instance- Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 11060–11069

work page 2023

[18] [18]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs.CoRRabs/1702.08734 (2017). http://dblp.uni-trier.de/db/ journals/corr/corr1702.html#JohnsonDJ17

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023

[20] [20]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao

work page

[21] [21]

InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom)

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 121–137. doi:10.1007/978-3-030-58577-8_8

work page doi:10.1007/978-3-030-58577-8_8 2020

[22] [22]

Zihan Liang, Yufei Ma, Zhipeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. 2025. UniECS: Unified Multimodal E- Commerce Search Framework with Gated Cross-modal Fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (Seoul, Republic of Korea)(CIKM ’25). Association for Comput...

work page doi:10.1145/3746252.3761170 2025

[23] [23]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306

work page 2024

[24] [24]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https://proceedings.neurips.cc/paper_ files/paper/2023/file/6dcf277ea32ce3288914faf369fe6...

work page 2023

[25] [25]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July ...

work page 2021

[26] [26]

Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. 2021. Quality metrics in recommender systems: Do we calculate metrics consistently?. InProceedings of the 15th ACM conference on recommender systems. 708–713

work page 2021

[27] [27]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 1437–1445. doi:10.1145/3343...

work page doi:10.1145/3343031.3351034 2019

[29] [29]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A survey on large language models for recommendation.World Wide Web27, 5 (Aug. 2024), 31 pages. doi:10.1007/s11280-024-01291-2

work page doi:10.1007/s11280-024-01291-2 2024

[30] [30]

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Wei Wang, Xiping Hu, Steven Hoi, and Edith C. H. Ngai. 2025. A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions.ArXivabs/2502.15711 (2025)

work page arXiv 2025

[31] [31]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. 2022. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335(2022)

work page arXiv 2022

[33] [33]

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large- scale language models.arXiv preprint arXiv:2309.10305(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Xiaoyong Yang, Yadong Zhu, Yi Zhang, Xiaobo Wang, and Quan Yuan. 2020. Large scale product graph construction for recommendation in e-commerce. arXiv preprint arXiv:2010.05525(2020)

work page arXiv 2020

[35] [35]

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A New Foundation Model for Computer Vision.ArXiva...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, and Enhong Chen. 2025. Notellm-2: Multimodal large representation models for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2815–2826

work page 2025

[37] [37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Choi Yejin, and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5579–5588

work page 2021

[38] [38]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Model Details We present the details of the three LLMs employed in our method in Table 5, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025