pith. sign in

arxiv: 2605.17366 · v1 · pith:Z4X4O7C5new · submitted 2026-05-17 · 💻 cs.IR

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

Pith reviewed 2026-05-19 23:04 UTC · model grok-4.3

classification 💻 cs.IR
keywords multimodal recommendationvisual representation learninge-commerce retrievaltext-guided learningQ-Formeritem-to-item recommendationrobust embeddingsnoisy images
0
0 comments X

The pith

TGQ-Former uses metadata as text guidance to extract robust visual tokens from cluttered product images for e-commerce retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for learning visual representations that remain effective even when real-world product photos contain promotional overlays and background clutter. It treats structured item metadata as semantic guidance that steers a connector to pull out relevant visual tokens while still keeping complementary image details. The approach separates the visual stream into metadata-anchored and exploratory parts, then uses a modulation module to adjust their balance when noise is present. Experiments on large-scale e-commerce datasets with full-pool retrieval show consistent gains over existing connector methods and full multimodal language models. A reader would care because better handling of noisy images directly improves item-to-item recommendation quality in practical retail settings.

Core claim

The central claim is that a Text-Guided Q-Former (TGQ-Former) with a hybrid-query connector disentangles metadata-anchored and exploratory visual streams, while a reliability-aware dual-gated vector modulation module adaptively calibrates their contributions under noisy inputs, thereby producing more robust multimodal item embeddings than prior connector baselines or end-to-end multimodal language models.

What carries the argument

Hybrid-query connector that disentangles metadata-anchored and exploratory visual streams, paired with a lightweight reliability-aware dual-gated vector modulation module for adaptive calibration.

If this is right

  • TGQ-Former improves Hit Rate@100 by 6.04% on average over strong connector baselines and end-to-end MLLMs on large-scale real-world e-commerce datasets.
  • The method maintains retrieval gains while handling spurious visual cues from promotional overlays and background clutter.
  • It preserves complementary visual evidence that would otherwise be lost when relying solely on metadata or frozen vision encoders.
  • The framework works with full-pool retrieval, showing effectiveness beyond limited candidate sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-guidance principle could be tested on other noisy-image domains such as user-generated photos in social recommendation.
  • Replacing the current modulation module with learned reliability predictors might further reduce sensitivity to metadata quality.
  • The hybrid-query design suggests a general pattern for any connector that must balance guided and open-ended feature extraction.

Load-bearing premise

Structured metadata is accurate and sufficient to serve as reliable semantic guidance that lets the connector separate visual streams without discarding useful image evidence.

What would settle it

Running the same full-pool retrieval experiments on the same datasets but with deliberately corrupted or missing metadata fields and checking whether the reported Hit Rate@100 gains over baselines disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.17366 by Jing Ma, Jungong Han, Pinghua Gong, Shijie Yang, Tianlu Zhang, Weijie Ding, Yanlong Zang, Yufei Guo.

Figure 2
Figure 2. Figure 2: Illustration of our proposed text-guided visual rep [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Illustration of different strategies for using LLMs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the proposed Text-Guided Visual Representation Learning framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-attention visualization of different query [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples from poster-style listings with [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Multimodal item embeddings are crucial for e-commerce item-to-item (I2I) retrieval, yet real-world product images often contain promotional overlays and background clutter that inject spurious visual cues and degrade retrieval robustness. This issue is particularly pronounced in MLRM-style pipelines, where a frozen vision encoder is connected to an LLM through a lightweight connector that must selectively aggregate visual tokens. We propose Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework that leverages structured metadata as semantic guidance for visual token extraction while preserving complementary visual evidence. Concretely, TGQ-Former employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams, and introduces a lightweight reliability-aware dual-gated vector modulation module to adaptively calibrate their contributions under noisy inputs. Experiments on large-scale, real-world e-commerce datasets with full-pool retrieval show that TGQ-Former consistently outperforms strong connector baselines and end-to-end MLLMs. On average, it improves Hit Rate@100 (H@100) by 6.04%, demonstrating the effectiveness of text-guided visual encoding for robust multimodal retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Text-Guided Q-Former (TGQ-Former), a text-guided visual representation learning framework for robust multimodal item-to-item retrieval in e-commerce. It employs a hybrid-query connector to disentangle metadata-anchored and exploratory visual streams from product images, augmented by a reliability-aware dual-gated vector modulation module to calibrate contributions under noisy inputs such as promotional overlays. Experiments on large-scale real-world e-commerce datasets with full-pool retrieval report consistent outperformance over strong connector baselines and end-to-end MLLMs, with an average 6.04% improvement in Hit Rate@100.

Significance. If the results hold under rigorous validation, the approach offers a practical advance for handling visual noise in multimodal e-commerce retrieval by leveraging structured metadata for guided token selection while preserving complementary visual evidence. The hybrid-query design and modulation module provide a targeted architectural response to a recurring deployment challenge, with potential for improved recommendation robustness.

major comments (2)
  1. [Experiments] Experiments section: the central claim of a 6.04% average H@100 lift is presented without reported dataset sizes, statistical significance tests (e.g., paired t-tests or bootstrap intervals), ablation controls isolating the hybrid-query connector, or exact baseline re-implementations, leaving the empirical support for outperformance only partially substantiated.
  2. [Method] Method description of the hybrid-query connector and reliability-aware dual-gated vector modulation module: the disentanglement of metadata-anchored and exploratory streams is asserted to succeed without discarding useful visual evidence, yet no quantitative sensitivity analysis to metadata perturbations, error rates, or completeness is supplied; this assumption is load-bearing for the robustness claims given the module's role in handling noisy inputs.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'large-scale, real-world e-commerce datasets' would benefit from a parenthetical note on approximate item counts or domain characteristics to orient readers.
  2. [Method] Notation for the dual-gated modulation: an explicit equation defining the gating functions and their interaction with the two visual streams would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will incorporate revisions to strengthen the empirical and methodological sections.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of a 6.04% average H@100 lift is presented without reported dataset sizes, statistical significance tests (e.g., paired t-tests or bootstrap intervals), ablation controls isolating the hybrid-query connector, or exact baseline re-implementations, leaving the empirical support for outperformance only partially substantiated.

    Authors: We agree that these details are necessary for rigorous validation. In the revised manuscript we will report the exact sizes of the large-scale e-commerce datasets (number of items, queries, and retrieval pool), include statistical significance tests such as paired t-tests or bootstrap intervals on the H@100 improvements, add ablation studies that isolate the hybrid-query connector, and specify the precise re-implementations and hyper-parameters of all baselines. These additions will provide fuller substantiation for the reported gains. revision: yes

  2. Referee: [Method] Method description of the hybrid-query connector and reliability-aware dual-gated vector modulation module: the disentanglement of metadata-anchored and exploratory streams is asserted to succeed without discarding useful visual evidence, yet no quantitative sensitivity analysis to metadata perturbations, error rates, or completeness is supplied; this assumption is load-bearing for the robustness claims given the module's role in handling noisy inputs.

    Authors: We acknowledge that a quantitative sensitivity analysis would strengthen the robustness claims. We will add such an analysis to the revised manuscript, systematically introducing controlled perturbations to metadata (varying noise levels, error rates, and completeness) and measuring the resulting impact on token selection and retrieval performance. This will empirically demonstrate that the hybrid-query design preserves complementary visual evidence under imperfect metadata conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of new architecture on external datasets

full rationale

The paper introduces TGQ-Former as a new text-guided visual representation framework with a hybrid-query connector and reliability-aware dual-gated modulation module. Performance claims (e.g., 6.04% H@100 improvement) are presented as results of experiments on large-scale real-world e-commerce datasets with full-pool retrieval, compared against baselines and end-to-end MLLMs. No equations, fitted parameters, or self-citations are shown to reduce the central claims to inputs by construction. The derivation chain consists of architectural design choices justified by the problem of spurious visual cues in product images, followed by empirical validation. This is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or self-citation load-bearing patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the empirical effectiveness of newly introduced architectural modules whose value is demonstrated through experiments rather than derived from first principles. No explicit numerical free parameters are stated in the abstract.

axioms (1)
  • domain assumption Structured metadata provides reliable semantic guidance for visual token extraction.
    Invoked when the framework is described as leveraging metadata to disentangle visual streams.
invented entities (3)
  • Text-Guided Q-Former (TGQ-Former) no independent evidence
    purpose: Framework for text-guided visual representation learning in noisy e-commerce images.
    Newly proposed connector architecture.
  • Hybrid-query connector no independent evidence
    purpose: Disentangles metadata-anchored and exploratory visual streams.
    Core component of the proposed method.
  • Reliability-aware dual-gated vector modulation module no independent evidence
    purpose: Adaptively calibrates contributions of visual streams under noisy inputs.
    Lightweight module introduced to handle noise.

pith-pipeline@v0.9.0 · 5749 in / 1471 out tokens · 38267 ms · 2026-05-19T23:04:45.034157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xiong-Hui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Rongyao Fang, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923(2025)

  5. [5]

    Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval...

  6. [6]

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer- Verlag, Berlin, Heidelberg, 104–120. doi:10....

  7. [7]

    Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems(Boston, Massachusetts, USA)(RecSys ’16). Association for Computing Machinery, New York, NY, USA, 191–198. doi:10.1145/2959100. 2959190

  8. [8]

    Nilotpal Das, Aniket Joshi, Promod Yenigalla, and Gourav Agrwal. 2022. MAPS: Multimodal Attention for Product Similarity. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 3551–3560

  9. [9]

    Xiuqi Deng, Lu Xu, Xiyao Li, Jinkai Yu, Erpeng Xue, Zhongyuan Wang, Di Zhang, Zhaojie Liu, Guorui Zhou, Yang Song, et al. 2024. End-to-End Training of Multimodal Model and Ranking Model.arXiv preprint arXiv:2404.06078(2024)

  10. [10]

    Karan Desai and Justin Johnson. 2021. VirTex: Learning Visual Representations from Textual Annotations. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 11157–11168. doi:10.1109/CVPR46437.2021.01101

  11. [11]

    Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2023. FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2023), 2669–2680

  12. [12]

    Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2024. Large Language Models are Zero-Shot Rankers for Recommender Systems. InAdvances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II(Glasgow, United Kingdom). Springe...

  13. [13]

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2022. Perceiver IO: A General Architecture for Structured In- puts & Outputs. InInternational Conference on Learning Representations (ICLR), Vol. abs/2107.14795

  14. [14]

    Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. 2021. Perceiver: General Perception with Iterative Attention. InProceedings of the 38th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 4651–4664. http://proceedi...

  15. [15]

    Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Super- vision. InProceedings of the 38th International Conference on Machine Learn- ing, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Ma...

  16. [16]

    Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2023. Bootstrapping vision- language learning with decoupled language pre-training. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 4, 16 pages

  17. [17]

    Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. 2023. Learning Instance- Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), 11060–11069

  18. [18]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs.CoRRabs/1702.08734 (2017). http://dblp.uni-trier.de/db/ journals/corr/corr1702.html#JohnsonDJ17

  19. [19]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  20. [20]

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao

  21. [21]

    InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom)

    Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 121–137. doi:10.1007/978-3-030-58577-8_8

  22. [22]

    Zihan Liang, Yufei Ma, Zhipeng Qian, Huangyu Dai, Zihan Wang, Ben Chen, Chenyi Lei, Yuqing Ding, and Han Li. 2025. UniECS: Unified Multimodal E- Commerce Search Framework with Gated Cross-modal Fusion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (Seoul, Republic of Korea)(CIKM ’25). Association for Comput...

  23. [23]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 26296–26306

  24. [24]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892–34916. https://proceedings.neurips.cc/paper_ files/paper/2023/file/6dcf277ea32ce3288914faf369fe6...

  25. [25]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July ...

  26. [26]

    Yan-Martin Tamm, Rinchin Damdinov, and Alexey Vasilev. 2021. Quality metrics in recommender systems: Do we calculate metrics consistently?. InProceedings of the 15th ACM conference on recommender systems. 708–713

  27. [27]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  28. [28]

    Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. InProceedings of the 27th ACM International Conference on Multimedia(Nice, France)(MM ’19). Association for Computing Machinery, New York, NY, USA, 1437–1445. doi:10.1145/3343...

  29. [29]

    Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, Hui Xiong, and Enhong Chen. 2024. A survey on large language models for recommendation.World Wide Web27, 5 (Aug. 2024), 31 pages. doi:10.1007/s11280-024-01291-2

  30. [30]

    Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Wei Wang, Xiping Hu, Steven Hoi, and Edith C. H. Ngai. 2025. A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions.ArXivabs/2502.15711 (2025)

  31. [31]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  32. [32]

    An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. 2022. Chinese clip: Contrastive vision-language pretraining in chinese.arXiv preprint arXiv:2211.01335(2022)

  33. [33]

    Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large- scale language models.arXiv preprint arXiv:2309.10305(2023)

  34. [34]

    Xiaoyong Yang, Yadong Zhu, Yi Zhang, Xiaobo Wang, and Quan Yuan. 2020. Large scale product graph construction for recommendation in e-commerce. arXiv preprint arXiv:2010.05525(2020)

  35. [35]

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A New Foundation Model for Computer Vision.ArXiva...

  36. [36]

    Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Xiangyu Zhao, Yan Gao, Yao Hu, and Enhong Chen. 2025. Notellm-2: Multimodal large representation models for recommendation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1. 2815–2826

  37. [37]

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Choi Yejin, and Jianfeng Gao. 2021. VinVL: Revisiting Visual Representations in Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5579–5588

  38. [38]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025)

  39. [39]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025). A Model Details We present the details of the three LLMs employed in our method in Table 5, ...