InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Clap learning audio concepts from natural language supervision · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

cs.MM · 2026-04-16 · unverdicted · novelty 6.0

ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

cs.SD · 2026-04-13 · unverdicted · novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.

Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

cs.IR · 2026-02-10 · unverdicted · novelty 5.0

TASTE dataset and MuQ-token aggregation enable effective use of audio features from large music models to improve content-based music recommendations over collaborative filtering alone.

citing papers explorer

Showing 4 of 4 citing papers.

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling cs.MM · 2026-04-16 · unverdicted · none · ref 13
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 13
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 13 · 2 links
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models cs.IR · 2026-02-10 · unverdicted · none · ref 7
TASTE dataset and MuQ-token aggregation enable effective use of audio features from large music models to improve content-based music recommendations over collaborative filtering alone.

InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer