Recognition: 2 theorem links
· Lean TheoremQwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Pith reviewed 2026-05-12 09:29 UTC · model grok-4.3
The pith
Qwen3-VL-Embedding-8B reaches 77.8 on the MMEB-V2 benchmark and leads all multimodal retrieval models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Qwen3-VL-Embedding model, built through contrastive pre-training followed by reranking distillation on the Qwen3-VL foundation, maps text, images, documents, and video into a single high-dimensional representation space that supports variable dimensions, while the Qwen3-VL-Reranker uses cross-attention to perform precise relevance estimation on query-document pairs; together they deliver state-of-the-art results on multimodal benchmarks, including an overall score of 77.8 on MMEB-V2 for the 8B embedding model.
What carries the argument
The multi-stage training paradigm that moves from large-scale contrastive pre-training to reranking model distillation, combined with Matryoshka Representation Learning, to produce unified embeddings across modalities.
If this is right
- The 8B embedding model attains an overall score of 77.8 on MMEB-V2 and ranks first among all models evaluated as of January 8, 2025.
- The models achieve state-of-the-art results across diverse multimodal embedding evaluation benchmarks.
- They support inputs up to 32k tokens and flexible embedding dimensions while operating in more than 30 languages.
- They demonstrate effectiveness on image-text retrieval, visual question answering, and video-text matching tasks.
Where Pith is reading between the lines
- The two-stage embedding-plus-reranker pipeline may become a standard pattern for high-precision multimodal search systems.
- Flexible embedding dimensions could allow practitioners to reduce storage and compute costs while retaining most accuracy.
- The same base architecture might be adapted to new modalities or longer contexts by repeating the described training stages.
Load-bearing premise
The MMEB-V2 benchmark and other multimodal evaluation sets accurately reflect real-world retrieval performance without significant biases in task design or data distribution.
What would settle it
Independent evaluation on a fresh collection of real-world multimodal queries and documents, not drawn from the training or benchmark distributions, that places the Qwen3-VL models below leading alternatives on the same metrics.
read the original abstract
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qwen3-VL-Embedding and Qwen3-VL-Reranker models as extensions of the Qwen3-VL foundation model for multimodal retrieval and ranking. These models map text, images, documents, and video into a unified embedding space using a multi-stage training process involving contrastive pre-training and distillation from a reranker. The embedding models support Matryoshka Representation Learning for flexible dimensions and up to 32k context length, while the reranker uses cross-attention. The series supports over 30 languages and is available in 2B and 8B sizes. The central claim is that the 8B embedding model achieves a state-of-the-art overall score of 77.8 on the MMEB-V2 benchmark, ranking first as of January 8, 2025, with strong results on tasks including image-text retrieval, VQA, and video-text matching.
Significance. If the results are substantiated with full experimental details and hold under independent verification, this contribution would be significant in the field of multimodal information retrieval. It offers a practical unified framework that handles multiple modalities and languages, potentially improving the performance of search and recommendation systems dealing with mixed media content. The inclusion of Matryoshka learning and long-context support adds flexibility for deployment. The high benchmark score indicates competitive performance, and releasing models of varying sizes broadens accessibility.
major comments (3)
- [Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.
- [Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.
- [Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.
minor comments (1)
- [Abstract] The abstract refers to 'diverse multimodal embedding evaluation benchmarks' but only explicitly names MMEB-V2; a complete list or reference to the full set of benchmarks used would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating planned revisions to strengthen the manuscript while maintaining accuracy in our claims.
read point-by-point responses
-
Referee: [Abstract] The claim that Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2 and ranks first is presented without accompanying details on training data composition, baseline models and their scores, statistical significance testing, or controls for post-hoc selection. This information is load-bearing for validating the state-of-the-art assertion.
Authors: We agree the abstract would be strengthened by additional context. The full manuscript (Section 4 and Table 1) provides baseline comparisons with scores for models such as CLIP, BLIP, and other recent multimodal embedders, along with our 77.8 overall score. Training data is described at a high level as large-scale web-crawled multimodal corpora with specific sources summarized in Section 3.1; full composition details are extensive and we report aggregate statistics rather than exhaustive lists. Statistical significance is evaluated through repeated runs on key subtasks where variance is reported. We will revise the abstract to briefly note the top competing baselines and their scores, and add a sentence referencing the public benchmark to address post-hoc selection. We cannot expand training data composition beyond what is already summarized without compromising proprietary details. revision: partial
-
Referee: [Abstract] The multi-stage training pipeline includes distillation from a reranker in the same model family, which introduces potential circularity in the evaluation; the manuscript should clarify how this affects the independence of the reported performance gains.
Authors: We thank the referee for highlighting this. The reranker distillation provides training signals (soft labels) for the embedding model, but MMEB-V2 evaluation uses the standard public benchmark protocol with no involvement of our reranker. All reported gains are measured against external baselines under identical conditions. We will add a clarifying statement in the training pipeline description (Section 3.2) explicitly noting that benchmark evaluation remains fully independent and that no circularity influences the SOTA ranking. revision: yes
-
Referee: [Abstract] There is no discussion or ablation regarding potential biases in the MMEB-V2 benchmark's task design and data distribution, such as over-representation of certain modalities or domains that might advantage models trained on similar web-scale data. This is critical for assessing whether the SOTA ranking generalizes beyond the specific benchmark.
Authors: This is a fair and important observation. The manuscript currently emphasizes empirical results without a dedicated limitations analysis of the benchmark. We will add a new paragraph in the Experiments section (or a Limitations subsection) discussing MMEB-V2's task distribution, noting its emphasis on image-text, VQA, and video tasks drawn from web sources, and acknowledging that models pretrained on similar data (including ours) may benefit from distributional overlap. We will also include modality-wise performance breakdowns as an ablation to illustrate where gains are concentrated. revision: yes
Circularity Check
No circularity: SOTA claim rests on external benchmark measurement
full rationale
The paper reports an empirical performance result (77.8 on MMEB-V2) obtained by running the trained model on a public external benchmark. The multi-stage training (contrastive pre-training followed by reranker distillation) and architectural choices (Matryoshka dimensions, 32k context) are described as engineering steps whose output is then evaluated on independent test sets. No equation, prediction, or uniqueness claim reduces the reported score to a self-defined metric, a fitted parameter renamed as a prediction, or a self-citation chain. The result is therefore falsifiable against the same external data by any other model and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation... Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2
-
IndisputableMonolith.Foundation.HierarchyForcinguniform_scaling_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
-
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
-
MINER: Mining Multimodal Internal Representation for Efficient Retrieval
MINER fuses internal transformer layer representations via probing and adaptive sparse fusion to improve dense single-vector retrieval quality on visual documents by up to 4.5% nDCG@5 while preserving efficiency.
-
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
-
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction
A single-pass black-box method models LLM outputs as dynamical systems via Koopman operators to detect hallucinations with claimed state-of-the-art accuracy and lower cost.
-
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while...
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
-
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
-
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
-
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
MMEmb-R1 adaptively applies chain-of-thought reasoning to multimodal embeddings via pair-aware counterfactual selection and RL, reaching 71.2 on MMEB-V2 with a 4B model and lower latency.
-
LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution
LucidNFT combines a new LR-referenced consistency reward, decoupled normalization, and a real-degradation dataset to improve perceptual quality in flow-matching super-resolution while preserving input fidelity.
-
A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
Single-vector aggregation in visual financial document retrieval collapses semantically distinct documents due to global texture dominance, as demonstrated by a new diagnostic benchmark where patch-level signals detec...
-
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 retrieves visual prototypes from memory to generate sparse spatial and dense contextual priors that refine detection prompts, yielding gains on rare categories in LVIS for both open-vocabulary and open-ended...
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 augments open-world object detection with retrieval from a visual memory bank to generate instance-level spatial and class-aware contextual priors that improve performance on rare categories in zero-shot LVIS tests.
-
VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection
VL-SAM-v3 improves open-world object detection on LVIS by retrieving visual prototypes from a memory bank to generate sparse spatial and dense contextual priors that are fused into detection prompts.
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2511.07025. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong ...
-
[2]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Associa- tion for Computational Linguistics: ACL 2024, pp. 2318–2335, Bangkok, Thail...
work page 2024
-
[5]
As- sociation for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.137. URL https: //aclanthology.org/2024.findings-acl.137/. Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Supervised fine-tuning or contrastive learning? towards better multimodal llm reranking.arXiv preprint...
-
[6]
URL https://arxiv.org/abs/ 2502.13595
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzeminski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. InInternational Conference on Learning Representations. International Conference on Learning Representations, 2025a. Kenneth Enevoldsen, Isa...
-
[7]
Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, et al. Moon embedding: Multimodal representation learning for e- commerce search advertising.arXiv preprint arXiv:2511.11305,
-
[8]
jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval.CoRR, abs/2506.18902,
URLhttps://arxiv.org/abs/2506.18902. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe T enth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
-
[9]
doi: 10.1109/TASLP .2024.3402087
ISSN 2329-9290. doi: 10.1109/TASLP .2024.3402087. URL https://doi.org/10.1109/TASLP.2024.3402087. Xin Huang and Kye Min Tan. Beyond text: Unlocking true multimodal, end- to-end rag with tomoro colqwen3,
-
[10]
URL https://tomoro.ai/insights/ beyond-text-unlocking-true-multimodal-end-to-end-rag-with-tomoro-colqwen3. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,
Weijian Jian, Yajun Zhang, Dawei Liang, Chunyu Xie, Yixiao He, Dawei Leng, and Yuhui Yin. Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,
-
[12]
E5-V: universal embeddings with multi- modal large language models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580,
-
[13]
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428,
work page internal anchor Pith review arXiv
-
[14]
Gemini Embedding: Generalizable Embeddings from Gemini
15 Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernán- dez Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini.arXiv preprint arXiv:2503.07891,
work page internal anchor Pith review arXiv
-
[15]
Mingxin Li, Zhijie Nie, Yanzhao Zhang, Dingkun Long, Richong Zhang, and Pengjun Xie. Improving general text embedding model: Tackling task conflict and data imbalance through model merging. arXiv preprint arXiv:2410.15035,
-
[16]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2504.08748 , year=
Lang Mei, Siyu Mo, Zhihan Yang, and Chong Chen. A survey of multimodal retrieval-augmented generation.arXiv preprint arXiv:2504.08748,
-
[18]
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590,
-
[19]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review arXiv
-
[21]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
URLhttps://nomic.ai/blog/posts/nomic-embed-multimodal. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of ...
work page internal anchor Pith review arXiv
-
[24]
URLhttps://arxiv.org/abs/2507.05513. Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025a. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level con...
-
[25]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 9274–9285, 2025b. Yanzhao Zhang, Mingxin Li, Dingkun Lon...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URLhttps://arxiv.org/abs/2506.20923. Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multimodal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 19076–19095,
-
[27]
17 A Dataset Examples Table 7: Dataset format examples: Docmatix † and MS-COCO (Lin et al., 2014). Dataset Docmatix InstructionFind a screenshot that relevant to the user’s question. Queries (Qi) q_01 What type of research project was announced by the Danish Cancer Society on 01/02/21? Corpus (Ci) d_01 d_02 d_03 Relevance (Ri){q_01: pos: [d_01], neg: [d_0...
work page 2014
-
[28]
InstructionRetrieve passages that answer this question. Ex. Query Document Sim. 1 Text:Which NFL team represented the AFC at Super Bowl 50? Text:Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated... 0.81 2 T...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.