ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
Omni-embed-nemotron: A unified multimodal retrieval model for text, image, audio, and video
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
citing papers explorer
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
Multimodal Data Curation Through Ranked Retrieval
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
-
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.