FreeRet: MLLMs as Training-Free Retrievers
Pith reviewed 2026-05-18 12:56 UTC · model grok-4.3
The pith
Off-the-shelf MLLMs can serve as powerful multimodal retrievers without any training by deriving faithful embeddings for search and using reasoning for reranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FreeRet shows that any off-the-shelf MLLM can function as a two-stage retriever without additional training: it bypasses lexical alignment layers and conditions representation generation on explicit priors to produce semantically faithful embeddings for fast candidate search, then applies neutral choice framing to reduce framing effects while using the model's reasoning for accurate reranking. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, this method substantially outperforms models trained on millions of pairs. The framework is model-agnostic, scales across families and sizes, preserves generative capabilities, supports arbitrary modality combinations, and unifies retrieval, rer킹
What carries the argument
The FreeRet two-stage framework that derives semantically grounded embeddings by bypassing lexical alignment and conditioning on priors, followed by reasoning-based reranking with neutral choice framing.
Load-bearing premise
That off-the-shelf MLLMs already contain semantically faithful embeddings and reliable reasoning capabilities that can be directly harnessed for retrieval without any post-hoc training or alignment adjustments.
What would settle it
A new multimodal retrieval benchmark on which FreeRet underperforms models trained on large contrastive datasets, or where removing the reasoning reranking step causes a large drop in accuracy.
Figures
read the original abstract
Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FreeRet, a plug-and-play, training-free framework that converts any off-the-shelf MLLM into a two-stage retriever. The first stage derives embeddings for fast candidate search by bypassing lexical alignment layers and conditioning on explicit priors; the second stage uses the MLLM's reasoning for reranking with neutral choice framing to mitigate framing effects. The approach is presented as model-agnostic, modality-flexible, and capable of unifying retrieval, reranking, and generation in a single model. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet is claimed to substantially outperform models trained on millions of pairs while preserving generative capabilities.
Significance. If the results hold under rigorous verification, the work would be significant for showing that pretrained MLLMs already encode retrieval-friendly representations that can be directly harnessed without contrastive fine-tuning. This could reduce the need for separate retrieval-specific training pipelines and support end-to-end RAG systems within unified multimodal models, with potential impact on generalist AI architectures.
major comments (2)
- [Experimental Evaluation] Experimental section: The headline claim of substantial outperformance on MMEB/MMEB-V2 lacks reported details on exact baselines (including their training data volume and architectures), statistical significance tests, error bars, or ablation on the contribution of each component (bypassing layers vs. priors vs. reranking). Without these, it is difficult to isolate whether gains stem from the proposed method or from implementation choices.
- [Embedding Derivation] Section describing the embedding stage: The assumption that bypassing lexical alignment layers produces embeddings whose cosine similarities reliably rank semantic relevance is load-bearing for the first-stage recall. No independent zero-shot retrieval metrics (e.g., recall@K on a held-out subset prior to reranking) are provided to validate embedding quality, leaving open the possibility that the reranker is compensating for a weak candidate pool.
minor comments (2)
- [Abstract] Abstract and introduction: Quantify the claimed 'substantial' improvements with specific metrics or relative gains rather than qualitative language.
- [Method] Clarify the precise formulation of 'explicit priors' and 'neutral choice framing' with pseudocode or a small example to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experimental details and validations as suggested.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section: The headline claim of substantial outperformance on MMEB/MMEB-V2 lacks reported details on exact baselines (including their training data volume and architectures), statistical significance tests, error bars, or ablation on the contribution of each component (bypassing layers vs. priors vs. reranking). Without these, it is difficult to isolate whether gains stem from the proposed method or from implementation choices.
Authors: We agree that these details strengthen the presentation. In the revised manuscript, we have added a table specifying all baselines with their exact architectures and training data volumes. We now report results with error bars computed over three independent runs and include p-values from paired statistical significance tests against the strongest baselines. We have also expanded the ablation study to isolate the contributions of bypassing lexical alignment layers, explicit priors, and the reranking stage separately. revision: yes
-
Referee: [Embedding Derivation] Section describing the embedding stage: The assumption that bypassing lexical alignment layers produces embeddings whose cosine similarities reliably rank semantic relevance is load-bearing for the first-stage recall. No independent zero-shot retrieval metrics (e.g., recall@K on a held-out subset prior to reranking) are provided to validate embedding quality, leaving open the possibility that the reranker is compensating for a weak candidate pool.
Authors: We acknowledge this concern. The revised manuscript now includes independent zero-shot retrieval metrics (recall@K at multiple K values) computed on held-out subsets using only the first-stage embeddings, prior to reranking. These results show that the embeddings achieve competitive initial recall, confirming that the reranker operates on a reasonably strong candidate pool rather than compensating for deficiencies in the embedding stage. revision: yes
Circularity Check
No significant circularity; empirical framework validated on benchmarks
full rationale
The paper presents FreeRet as a plug-and-play, training-free method that derives embeddings by bypassing lexical alignment layers in off-the-shelf MLLMs and uses the model's reasoning for reranking. All central claims of outperformance are grounded in direct experimental results on the MMEB and MMEB-V2 benchmarks spanning 46 datasets, rather than any mathematical derivations, predictions, or first-principles results that reduce to the inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked in a load-bearing way that would create circularity. The approach is model-agnostic and empirically falsifiable, making the reported findings self-contained without tautological reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
bypassing the final MLP before the LM head... Removing it yields embeddings that better capture underlying meaning
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On the MMEB and MMEB-V2 benchmarks... FreeRet substantially outperforms models trained on millions of pairs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.