arxiv: 2601.04720 · v2 · submitted 2026-01-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li , Yanzhao Zhang , Dingkun Long , Keqin Chen , Sibo Song , Shuai Bai , Zhibo Yang , Pengjun Xie

show 4 more authors

An Yang Dayiheng Liu Jingren Zhou Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal retrievalembedding modelsrerankingcontrastive pre-trainingvision-language modelscross-encoderunified representation spaceMatryoshka learning

0 comments

The pith

Qwen3-VL-Embedding-8B reaches 77.8 on the MMEB-V2 benchmark and leads all multimodal retrieval models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker as extensions of the Qwen3-VL model to create a unified pipeline for retrieving and ranking across text, images, document images, and video. The embedding model uses multi-stage training that starts with large-scale contrastive pre-training and then distills from a reranker, while also supporting flexible embedding sizes through Matryoshka Representation Learning and inputs up to 32k tokens. The reranker applies cross-attention in a cross-encoder setup to score relevance between queries and documents at fine grain. Both models retain support for more than 30 languages and are released in 2B and 8B sizes. A sympathetic reader would care because accurate multimodal search lets people find relevant content when queries and results mix different media types.

Core claim

The Qwen3-VL-Embedding model, built through contrastive pre-training followed by reranking distillation on the Qwen3-VL foundation, maps text, images, documents, and video into a single high-dimensional representation space that supports variable dimensions, while the Qwen3-VL-Reranker uses cross-attention to perform precise relevance estimation on query-document pairs; together they deliver state-of-the-art results on multimodal benchmarks, including an overall score of 77.8 on MMEB-V2 for the 8B embedding model.

What carries the argument

The multi-stage training paradigm that moves from large-scale contrastive pre-training to reranking model distillation, combined with Matryoshka Representation Learning, to produce unified embeddings across modalities.

If this is right

The 8B embedding model attains an overall score of 77.8 on MMEB-V2 and ranks first among all models evaluated as of January 8, 2025.
The models achieve state-of-the-art results across diverse multimodal embedding evaluation benchmarks.
They support inputs up to 32k tokens and flexible embedding dimensions while operating in more than 30 languages.
They demonstrate effectiveness on image-text retrieval, visual question answering, and video-text matching tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage embedding-plus-reranker pipeline may become a standard pattern for high-precision multimodal search systems.
Flexible embedding dimensions could allow practitioners to reduce storage and compute costs while retaining most accuracy.
The same base architecture might be adapted to new modalities or longer contexts by repeating the described training stages.

Load-bearing premise

The MMEB-V2 benchmark and other multimodal evaluation sets accurately reflect real-world retrieval performance without significant biases in task design or data distribution.

What would settle it

Independent evaluation on a fresh collection of real-world multimodal queries and documents, not drawn from the training or benchmark distributions, that places the Qwen3-VL models below leading alternatives on the same metrics.

read the original abstract

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3-VL-Embedding and Reranker give a practical multimodal pipeline with a reported 77.8 on MMEB-V2, but the abstract leaves too many gaps in data and experiments to judge the gains.

read the letter

The punchline is that Qwen3-VL-Embedding-8B reaches 77.8 on MMEB-V2 and tops the leaderboard, using a multi-stage pipeline on the Qwen3-VL base that includes contrastive pre-training, reranker distillation, Matryoshka dimensions, and 32k context. This gives a unified way to embed and rerank across text, images, documents, and video in over 30 languages. The paper does a solid job releasing usable models in 2B and 8B sizes that build directly on an existing strong foundation model. The end-to-end pipeline idea is straightforward and addresses a real need for multimodal retrieval without juggling separate systems. Soft spots center on missing information. There are no specifics on the composition of the large-scale training data, no ablations for the training stages, and no discussion of statistical significance or how baselines were selected. The internal distillation step could introduce circularity if not isolated properly. The concern about MMEB-V2 benchmark biases in task design and data distribution is worth taking seriously, since no external validation or subset analysis is mentioned to show the results generalize. This is for practitioners building multimodal search or retrieval-augmented systems who want off-the-shelf options, and for researchers tracking embedding model progress. A reader focused on production deployment would find the models and scores useful to evaluate, while someone seeking methodological innovations might look elsewhere. It deserves serious referee time because the models are new and the performance numbers, if the experiments hold up, would be a useful data point in the field. I recommend sending it for peer review, with the expectation that reviewers will ask for more experimental details and robustness checks.

Referee Report

3 major / 1 minor

Summary. The paper introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker models as extensions of the Qwen3-VL foundation model for multimodal retrieval and ranking. These models map text, images, documents, and video into a unified embedding space using a multi-stage training process involving contrastive pre-training and distillation from a reranker. The embedding models support Matryoshka Representation Learning for flexible dimensions and up to 32k context length, while the reranker uses cross-attention. The series supports over 30 languages and is available in 2B and 8B sizes. The central claim is that the 8B embedding model achieves a state-of-the-art overall score of 77.8 on the MMEB-V2 benchmark, ranking first as of January 8, 2025, with strong results on tasks including image-text retrieval, VQA, and video-text matching.

Significance. If the results are substantiated with full experimental details and hold under independent verification, this contribution would be significant in the field of multimodal information retrieval. It offers a practical unified framework that handles multiple modalities and languages, potentially improving the performance of search and recommendation systems dealing with mixed media content. The inclusion of Matryoshka learning and long-context support adds flexibility for deployment. The high benchmark score indicates competitive performance, and releasing models of varying sizes broadens accessibility.

major comments (3)

[Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.
[Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.
[Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.

minor comments (1)

[Abstract] The abstract refers to 'diverse multimodal embedding evaluation benchmarks' but only explicitly names MMEB-V2; a complete list or reference to the full set of benchmarks used would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating planned revisions to strengthen the manuscript while maintaining accuracy in our claims.

read point-by-point responses

Referee: [Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.

Authors: We agree the abstract would be strengthened by additional context. The full manuscript (Section 4 and Table 1) provides baseline comparisons with scores for models such as CLIP, BLIP, and other recent multimodal embedders, along with our 77.8 overall score. Training data is described at a high level as large-scale web-crawled multimodal corpora with specific sources summarized in Section 3.1; full composition details are extensive and we report aggregate statistics rather than exhaustive lists. Statistical significance is evaluated through repeated runs on key subtasks where variance is reported. We will revise the abstract to briefly note the top competing baselines and their scores, and add a sentence referencing the public benchmark to address post-hoc selection. We cannot expand training data composition beyond what is already summarized without compromising proprietary details. revision: partial
Referee: [Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.

Authors: We thank the referee for highlighting this. The reranker distillation provides training signals (soft labels) for the embedding model, but MMEB-V2 evaluation uses the standard public benchmark protocol with no involvement of our reranker. All reported gains are measured against external baselines under identical conditions. We will add a clarifying statement in the training pipeline description (Section 3.2) explicitly noting that benchmark evaluation remains fully independent and that no circularity influences the SOTA ranking. revision: yes
Referee: [Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.

Authors: This is a fair and important observation. The manuscript currently emphasizes empirical results without a dedicated limitations analysis of the benchmark. We will add a new paragraph in the Experiments section (or a Limitations subsection) discussing MMEB-V2's task distribution, noting its emphasis on image-text, VQA, and video tasks drawn from web sources, and acknowledging that models pretrained on similar data (including ours) may benefit from distributional overlap. We will also include modality-wise performance breakdowns as an ablation to illustrate where gains are concentrated. revision: yes

Circularity Check

0 steps flagged

No circularity: SOTA claim rests on external benchmark measurement

full rationale

The paper reports an empirical performance result (77.8 on MMEB-V2) obtained by running the trained model on a public external benchmark. The multi-stage training (contrastive pre-training followed by reranker distillation) and architectural choices (Matryoshka dimensions, 32k context) are described as engineering steps whose output is then evaluated on independent test sets. No equation, prediction, or uniqueness claim reduces the reported score to a self-defined metric, a fitted parameter renamed as a prediction, or a self-citation chain. The result is therefore falsifiable against the same external data by any other model and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; all claims rest on standard contrastive learning assumptions and benchmark evaluations that are not detailed here.

pith-pipeline@v0.9.0 · 5662 in / 1134 out tokens · 56182 ms · 2026-05-12T09:29:15.790012+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation... Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2
IndisputableMonolith.Foundation.HierarchyForcing uniform_scaling_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
cs.CV 2026-04 unverdicted novelty 8.0

FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
cs.MM 2026-05 unverdicted novelty 7.0

FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
cs.CV 2026-04 unverdicted novelty 7.0

CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
MINER: Mining Multimodal Internal Representation for Efficient Retrieval
cs.LG 2026-05 unverdicted novelty 6.0

MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
cs.LG 2026-05 unverdicted novelty 6.0

A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while...
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 conditional novelty 6.0

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
cs.AI 2026-04 unverdicted novelty 6.0

MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
cs.CR 2026-04 unverdicted novelty 6.0

Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
cs.CV 2026-04 unverdicted novelty 6.0

MMEmb-R1 adaptively applies chain-of-thought reasoning to multimodal embeddings via pair-aware counterfactual selection and RL, reaching 71.2 on MMEB-V2 with a 4B model and lower latency.
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
cs.CV 2026-05 conditional novelty 5.0

Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
cs.CV 2026-05 unverdicted novelty 5.0

MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
cs.CV 2026-05 unverdicted novelty 5.0

VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
cs.CV 2026-05 unverdicted novelty 4.0

Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 27 Pith papers · 12 internal anchors

[1]

URLhttps://arxiv.org/abs/2511.07025. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong ...

work page arXiv
[2]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Associa- tion for Computational Linguistics: ACL 2024, pp. 2318–2335, Bangkok, Thail...

work page 2024
[5]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

As- sociation for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.137. URL https: //aclanthology.org/2024.findings-acl.137/. Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Supervised fine-tuning or contrastive learning? towards better multimodal llm reranking.arXiv preprint...

work page doi:10.18653/v1/2024.findings-acl.137 2024
[6]

Mmteb: Massive multilingual text embedding bench- mark.arXiv preprint arXiv:2502.13595,

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzeminski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations. International Conference on Learning Representations, 2025a. Kenneth Enevoldsen, Isa...

work page arXiv
[7]

Moon embedding: Multimodal representation learning for e- commerce search advertising.arXiv preprint arXiv:2511.11305,

Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, et al. Moon embedding: Multimodal representation learning for e- commerce search advertising.arXiv preprint arXiv:2511.11305,

work page arXiv
[8]

arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

URLhttps://arxiv.org/abs/2506.18902. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe T enth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

work page arXiv 2022
[9]

doi: 10.1109/TASLP .2024.3402087

ISSN 2329-9290. doi: 10.1109/TASLP .2024.3402087. URL https://doi.org/10.1109/TASLP.2024.3402087. Xin Huang and Kye Min Tan. Beyond text: Unlocking true multimodal, end- to-end rag with tomoro colqwen3,

work page doi:10.1109/taslp 2024
[10]

GPT-4o System Card

URL https://tomoro.ai/insights/ beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,

Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,

work page arXiv
[12]

E5-V: universal embeddings with multi- modal large language models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580,

work page arXiv
[13]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428,

work page internal anchor Pith review arXiv
[14]

Gemini Embedding: Generalizable Embeddings from Gemini

15 Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernán- dez Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini.arXiv preprint arXiv:2503.07891,

work page internal anchor Pith review arXiv
[15]

Improving general text embedding model: Tackling task conflict and data imbalance through model merging

Mingxin Li, Zhijie Nie, Yanzhao Zhang, Dingkun Long, Richong Zhang, and Pengjun Xie. Improving general text embedding model: Tackling task conflict and data imbalance through model merging. arXiv preprint arXiv:2410.15035,

work page arXiv
[16]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2504.08748 , year=

Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. A survey of multimodal retrieval-augmented generation.arXiv preprint arXiv:2504.08748,

work page arXiv
[18]

arXiv preprint arXiv:2507.04590 , year=

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590,

work page arXiv
[19]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review arXiv
[21]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

URLhttps://nomic.ai/blog/posts/nomic-embed-multimodal. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of ...

work page internal anchor Pith review arXiv
[24]

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al

URLhttps://arxiv.org/abs/2507.05513. Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025a. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level con...

work page arXiv
[25]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 9274–9285, 2025b. Yanzhao Zhang, Mingxin Li, Dingkun Lon...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian

URLhttps://arxiv.org/abs/2506.20923. Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multimodal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 19076–19095,

work page arXiv
[27]

What brand is the watch?

17 A Dataset Examples Table 7: Dataset format examples: Docmatix † and MS-COCO (Lin et al., 2014). Dataset Docmatix InstructionFind a screenshot that relevant to the user’s question. Queries (Qi) q_01 What type of research project was announced by the Danish Cancer Society on 01/02/21? Corpus (Ci) d_01 d_02 d_03 Relevance (Ri){q_01: pos: [d_01], neg: [d_0...

work page 2014
[28]

InstructionRetrieve passages that answer this question. Ex. Query Document Sim. 1 Text:Which NFL team represented the AFC at Super Bowl 50? Text:Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated... 0.81 2 T...

work page 2015