FMA: A Dataset For Music Analysis

Micha\"el Defferrard , Kirell Benzi , Pierre Vandergheynst , Xavier Bresson

Authors on Pith no claims yet

classification 💻 cs.SD cs.IR

keywords audiodatasetmusiclargesomesuitabletasksaccessible

read the original abstract

We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
eess.AS 2026-05 unverdicted novelty 6.0

L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
cs.SD 2026-05 unverdicted novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations
eess.AS 2026-04 unverdicted novelty 5.0

UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
cs.SD 2026-04 unverdicted novelty 5.0

Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.
Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music
cs.SD 2026-04 unverdicted novelty 5.0

Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 5.0

HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)
eess.AS 2026-04 unverdicted novelty 4.0

Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.