Title resolution pending

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0 · 2 refs

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

cs.MM · 2026-04-16 · unverdicted · novelty 7.0

Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text performance unchanged.

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

cs.MM · 2026-04-16 · unverdicted · novelty 6.0

ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

cs.SD · 2026-04-13 · unverdicted · novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.

Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

cs.IR · 2026-02-10 · unverdicted · novelty 5.0

TASTE dataset and MuQ-token aggregation enable effective use of audio features from large music models to improve content-based music recommendations over collaborative filtering alone.

citing papers explorer

Showing 8 of 8 citing papers.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval cs.AI · 2026-05-05 · unverdicted · none · ref 12 · 2 links
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery cs.MM · 2026-04-16 · unverdicted · none · ref 15
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise cs.IR · 2026-02-13 · unverdicted · none · ref 9
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers cs.CL · 2026-05-08 · unverdicted · none · ref 10 · 2 links
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text performance unchanged.
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling cs.MM · 2026-04-16 · unverdicted · none · ref 12
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 12
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 12 · 2 links
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models cs.IR · 2026-02-10 · unverdicted · none · ref 6
TASTE dataset and MuQ-token aggregation enable effective use of audio features from large music models to improve content-based music recommendations over collaborative filtering alone.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer