A unified Transformer model with modality-specific tokenization, trained on a new 1300-hour multimodal music dataset, outperforms single-task baselines on optical music recognition and other translations while achieving the first score-image-conditioned audio generation.
Performance midi-to-score conversion by neural beat tracking,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2025 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio
A unified Transformer model with modality-specific tokenization, trained on a new 1300-hour multimodal music dataset, outperforms single-task baselines on optical music recognition and other translations while achieving the first score-image-conditioned audio generation.