pith. machine review for the scientific record. sign in

arxiv: 2601.04720 · v2 · submitted 2026-01-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal retrievalembedding modelsrerankingcontrastive pre-trainingvision-language modelscross-encoderunified representation spaceMatryoshka learning
0
0 comments X

The pith

Qwen3-VL-Embedding-8B reaches 77.8 on the MMEB-V2 benchmark and leads all multimodal retrieval models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker as extensions of the Qwen3-VL model to create a unified pipeline for retrieving and ranking across text, images, document images, and video. The embedding model uses multi-stage training that starts with large-scale contrastive pre-training and then distills from a reranker, while also supporting flexible embedding sizes through Matryoshka Representation Learning and inputs up to 32k tokens. The reranker applies cross-attention in a cross-encoder setup to score relevance between queries and documents at fine grain. Both models retain support for more than 30 languages and are released in 2B and 8B sizes. A sympathetic reader would care because accurate multimodal search lets people find relevant content when queries and results mix different media types.

Core claim

The Qwen3-VL-Embedding model, built through contrastive pre-training followed by reranking distillation on the Qwen3-VL foundation, maps text, images, documents, and video into a single high-dimensional representation space that supports variable dimensions, while the Qwen3-VL-Reranker uses cross-attention to perform precise relevance estimation on query-document pairs; together they deliver state-of-the-art results on multimodal benchmarks, including an overall score of 77.8 on MMEB-V2 for the 8B embedding model.

What carries the argument

The multi-stage training paradigm that moves from large-scale contrastive pre-training to reranking model distillation, combined with Matryoshka Representation Learning, to produce unified embeddings across modalities.

If this is right

  • The 8B embedding model attains an overall score of 77.8 on MMEB-V2 and ranks first among all models evaluated as of January 8, 2025.
  • The models achieve state-of-the-art results across diverse multimodal embedding evaluation benchmarks.
  • They support inputs up to 32k tokens and flexible embedding dimensions while operating in more than 30 languages.
  • They demonstrate effectiveness on image-text retrieval, visual question answering, and video-text matching tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage embedding-plus-reranker pipeline may become a standard pattern for high-precision multimodal search systems.
  • Flexible embedding dimensions could allow practitioners to reduce storage and compute costs while retaining most accuracy.
  • The same base architecture might be adapted to new modalities or longer contexts by repeating the described training stages.

Load-bearing premise

The MMEB-V2 benchmark and other multimodal evaluation sets accurately reflect real-world retrieval performance without significant biases in task design or data distribution.

What would settle it

Independent evaluation on a fresh collection of real-world multimodal queries and documents, not drawn from the training or benchmark distributions, that places the Qwen3-VL models below leading alternatives on the same metrics.

read the original abstract

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker models as extensions of the Qwen3-VL foundation model for multimodal retrieval and ranking. These models map text, images, documents, and video into a unified embedding space using a multi-stage training process involving contrastive pre-training and distillation from a reranker. The embedding models support Matryoshka Representation Learning for flexible dimensions and up to 32k context length, while the reranker uses cross-attention. The series supports over 30 languages and is available in 2B and 8B sizes. The central claim is that the 8B embedding model achieves a state-of-the-art overall score of 77.8 on the MMEB-V2 benchmark, ranking first as of January 8, 2025, with strong results on tasks including image-text retrieval, VQA, and video-text matching.

Significance. If the results are substantiated with full experimental details and hold under independent verification, this contribution would be significant in the field of multimodal information retrieval. It offers a practical unified framework that handles multiple modalities and languages, potentially improving the performance of search and recommendation systems dealing with mixed media content. The inclusion of Matryoshka learning and long-context support adds flexibility for deployment. The high benchmark score indicates competitive performance, and releasing models of varying sizes broadens accessibility.

major comments (3)
  1. [Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.
  2. [Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.
  3. [Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.
minor comments (1)
  1. [Abstract] The abstract refers to 'diverse multimodal embedding evaluation benchmarks' but only explicitly names MMEB-V2; a complete list or reference to the full set of benchmarks used would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating planned revisions to strengthen the manuscript while maintaining accuracy in our claims.

read point-by-point responses
  1. Referee: [Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.

    Authors: We agree the abstract would be strengthened by additional context. The full manuscript (Section 4 and Table 1) provides baseline comparisons with scores for models such as CLIP, BLIP, and other recent multimodal embedders, along with our 77.8 overall score. Training data is described at a high level as large-scale web-crawled multimodal corpora with specific sources summarized in Section 3.1; full composition details are extensive and we report aggregate statistics rather than exhaustive lists. Statistical significance is evaluated through repeated runs on key subtasks where variance is reported. We will revise the abstract to briefly note the top competing baselines and their scores, and add a sentence referencing the public benchmark to address post-hoc selection. We cannot expand training data composition beyond what is already summarized without compromising proprietary details. revision: partial

  2. Referee: [Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.

    Authors: We thank the referee for highlighting this. The reranker distillation provides training signals (soft labels) for the embedding model, but MMEB-V2 evaluation uses the standard public benchmark protocol with no involvement of our reranker. All reported gains are measured against external baselines under identical conditions. We will add a clarifying statement in the training pipeline description (Section 3.2) explicitly noting that benchmark evaluation remains fully independent and that no circularity influences the SOTA ranking. revision: yes

  3. Referee: [Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.

    Authors: This is a fair and important observation. The manuscript currently emphasizes empirical results without a dedicated limitations analysis of the benchmark. We will add a new paragraph in the Experiments section (or a Limitations subsection) discussing MMEB-V2's task distribution, noting its emphasis on image-text, VQA, and video tasks drawn from web sources, and acknowledging that models pretrained on similar data (including ours) may benefit from distributional overlap. We will also include modality-wise performance breakdowns as an ablation to illustrate where gains are concentrated. revision: yes

Circularity Check

0 steps flagged

No circularity: SOTA claim rests on external benchmark measurement

full rationale

The paper reports an empirical performance result (77.8 on MMEB-V2) obtained by running the trained model on a public external benchmark. The multi-stage training (contrastive pre-training followed by reranker distillation) and architectural choices (Matryoshka dimensions, 32k context) are described as engineering steps whose output is then evaluated on independent test sets. No equation, prediction, or uniqueness claim reduces the reported score to a self-defined metric, a fitted parameter renamed as a prediction, or a self-citation chain. The result is therefore falsifiable against the same external data by any other model and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; all claims rest on standard contrastive learning assumptions and benchmark evaluations that are not detailed here.

pith-pipeline@v0.9.0 · 5662 in / 1134 out tokens · 56182 ms · 2026-05-12T09:29:15.790012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  2. FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

    cs.CV 2026-04 unverdicted novelty 8.0

    FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.

  3. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

  4. FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries

    cs.MM 2026-05 unverdicted novelty 7.0

    FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.

  5. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  6. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 7.0

    Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.

  7. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  8. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.

  9. CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

    cs.CV 2026-04 unverdicted novelty 7.0

    CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.

  10. jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

    cs.CL 2026-05 unverdicted novelty 6.0

    GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...

  11. MINER: Mining Multimodal Internal Representation for Efficient Retrieval

    cs.LG 2026-05 unverdicted novelty 6.0

    MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.

  12. Towards Generation-Efficient Uncertainty Estimation in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.

  13. Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction

    cs.LG 2026-05 unverdicted novelty 6.0

    A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.

  14. ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

    cs.AI 2026-05 unverdicted novelty 6.0

    ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.

  15. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  16. Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while...

  17. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 conditional novelty 6.0

    SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

  18. Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...

  19. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  20. Small Vision-Language Models are Smart Compressors for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

  21. Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.

  22. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  23. MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

    cs.CV 2026-04 unverdicted novelty 6.0

    MMEmb-R1 adaptively applies chain-of-thought reasoning to multimodal embeddings via pair-aware counterfactual selection and RL, reaching 71.2 on MMEB-V2 with a 4B model and lower latency.

  24. A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval

    cs.CV 2026-05 conditional novelty 5.0

    Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...

  25. MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

    cs.CV 2026-05 unverdicted novelty 5.0

    MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.

  26. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...

  27. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.

  28. VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.

  29. Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.

  30. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  31. Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers

    cs.CV 2026-05 unverdicted novelty 4.0

    Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 27 Pith papers · 12 internal anchors

  1. [1]

    URLhttps://arxiv.org/abs/2511.07025. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong ...

  2. [2]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268,

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  4. [4]

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Associa- tion for Computational Linguistics: ACL 2024, pp. 2318–2335, Bangkok, Thail...

  5. [5]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

    As- sociation for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.137. URL https: //aclanthology.org/2024.findings-acl.137/. Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Supervised fine-tuning or contrastive learning? towards better multimodal llm reranking.arXiv preprint...

  6. [6]

    Mmteb: Massive multilingual text embedding bench- mark.arXiv preprint arXiv:2502.13595,

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzeminski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations. International Conference on Learning Representations, 2025a. Kenneth Enevoldsen, Isa...

  7. [7]

    Moon embedding: Multimodal representation learning for e- commerce search advertising.arXiv preprint arXiv:2511.11305,

    Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, et al. Moon embedding: Multimodal representation learning for e- commerce search advertising.arXiv preprint arXiv:2511.11305,

  8. [8]

    arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

    URLhttps://arxiv.org/abs/2506.18902. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe T enth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  9. [9]

    doi: 10.1109/TASLP .2024.3402087

    ISSN 2329-9290. doi: 10.1109/TASLP .2024.3402087. URL https://doi.org/10.1109/TASLP.2024.3402087. Xin Huang and Kye Min Tan. Beyond text: Unlocking true multimodal, end- to-end rag with tomoro colqwen3,

  10. [10]

    GPT-4o System Card

    URL https://tomoro.ai/insights/ beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  11. [11]

    Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,

    Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,

  12. [12]

    E5-V: universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580,

  13. [13]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428,

  14. [14]

    Gemini Embedding: Generalizable Embeddings from Gemini

    15 Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernán- dez Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini.arXiv preprint arXiv:2503.07891,

  15. [15]

    Improving general text embedding model: Tackling task conflict and data imbalance through model merging

    Mingxin Li, Zhijie Nie, Yanzhao Zhang, Dingkun Long, Richong Zhang, and Pengjun Xie. Improving general text embedding model: Tackling task conflict and data imbalance through model merging. arXiv preprint arXiv:2410.15035,

  16. [16]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281,

  17. [17]

    arXiv preprint arXiv:2504.08748 , year=

    Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. A survey of multimodal retrieval-augmented generation.arXiv preprint arXiv:2504.08748,

  18. [18]

    arXiv preprint arXiv:2507.04590 , year=

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590,

  19. [19]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

  20. [20]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

  21. [21]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

  22. [22]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    URLhttps://nomic.ai/blog/posts/nomic-embed-multimodal. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

  23. [23]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of ...

  24. [24]

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al

    URLhttps://arxiv.org/abs/2507.05513. Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025a. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level con...

  25. [25]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 9274–9285, 2025b. Yanzhao Zhang, Mingxin Li, Dingkun Lon...

  26. [26]

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian

    URLhttps://arxiv.org/abs/2506.20923. Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multimodal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 19076–19095,

  27. [27]

    What brand is the watch?

    17 A Dataset Examples Table 7: Dataset format examples: Docmatix † and MS-COCO (Lin et al., 2014). Dataset Docmatix InstructionFind a screenshot that relevant to the user’s question. Queries (Qi) q_01 What type of research project was announced by the Danish Cancer Society on 01/02/21? Corpus (Ci) d_01 d_02 d_03 Relevance (Ri){q_01: pos: [d_01], neg: [d_0...

  28. [28]

    InstructionRetrieve passages that answer this question. Ex. Query Document Sim. 1 Text:Which NFL team represented the AFC at Super Bowl 50? Text:Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated... 0.81 2 T...