Omni-embed-nemotron: A unified multimodal retrieval model for text, image, audio, and video

Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer · 2025 · arXiv 2510.03458

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0 · 2 refs

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

Multimodal Data Curation Through Ranked Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

cs.SD · 2026-04-22 · unverdicted · novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

citing papers explorer

Showing 6 of 6 citing papers.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval cs.AI · 2026-05-05 · unverdicted · none · ref 32 · 2 links
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
Multimodal Data Curation Through Ranked Retrieval cs.IR · 2026-05-01 · unverdicted · none · ref 15
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 38
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval cs.SD · 2026-04-22 · unverdicted · none · ref 57
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise cs.IR · 2026-02-13 · unverdicted · none · ref 42
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval cs.SD · 2026-04-20 · unverdicted · none · ref 26
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

Omni-embed-nemotron: A unified multimodal retrieval model for text, image, audio, and video

fields

years

verdicts

representative citing papers

citing papers explorer