A unified Transformer model with modality-specific tokenization, trained on a new 1300-hour multimodal music dataset, outperforms single-task baselines on optical music recognition and other translations while achieving the first score-image-conditioned audio generation.
Decoupled weight decay regularization,
2 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Caption-Matching generates image captions via pre-trained VLMs and matches them across domains to achieve SOTA CDIR performance on Office-Home and DomainNet without labeled data or fine-tuning.
citing papers explorer
-
Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio
A unified Transformer model with modality-specific tokenization, trained on a new 1300-hour multimodal music dataset, outperforms single-task baselines on optical music recognition and other translations while achieving the first score-image-conditioned audio generation.
-
Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval
Caption-Matching generates image captions via pre-trained VLMs and matches them across domains to achieve SOTA CDIR performance on Office-Home and DomainNet without labeled data or fine-tuning.