hub

Fma: A dataset for music analysis

· 2016 · cs.SD · arXiv 1612.01840

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open full Pith review browse 18 citing papers arXiv PDF

abstract

We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

MusicDET: Zero-Shot AI-Generated Music Detection

cs.SD · 2026-05-18 · unverdicted · novelty 7.0

MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

cs.SD · 2025-09-28 · unverdicted · novelty 7.0

AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm

eess.AS · 2019-06-25 · unverdicted · novelty 7.0

DALI dataset of 5358 tracks with aligned lyrics and notes is produced by iterative teacher-student singing-voice detection that refines web audio matches to initial karaoke annotations.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

cs.SD · 2025-11-26 · conditional · novelty 6.0

A black-box audio watermark removal attack trained on limited samples that generalizes across datasets and watermark schemes with high attack success rates.

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

cs.SD · 2025-09-15 · unverdicted · novelty 6.0

CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

eess.AS · 2026-05-09 · unverdicted · novelty 6.0

L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

cs.SD · 2025-02-06 · unverdicted · novelty 5.0

XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

cs.SD · 2026-05-01 · unverdicted · novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.

UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations

eess.AS · 2026-04-16 · unverdicted · novelty 5.0

UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

eess.AS · 2026-04-14 · unverdicted · novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences

cs.SD · 2026-04-12 · unverdicted · novelty 5.0

Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

cs.SD · 2026-04-12 · unverdicted · novelty 5.0

Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.

HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

cs.SD · 2026-04-10 · unverdicted · novelty 5.0

HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.

A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)

eess.AS · 2026-04-30 · unverdicted · novelty 4.0

Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

eess.AS · 2026-05-15 · unverdicted · novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

citing papers explorer

Showing 18 of 18 citing papers.

MusicDET: Zero-Shot AI-Generated Music Detection cs.SD · 2026-05-18 · unverdicted · none · ref 6 · internal anchor
MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation cs.SD · 2026-05-15 · unverdicted · none · ref 22 · internal anchor
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance cs.SD · 2025-09-28 · unverdicted · none · ref 11 · internal anchor
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models cs.SD · 2025-07-10 · unverdicted · none · ref 25 · internal anchor
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm eess.AS · 2019-06-25 · unverdicted · none · ref 11 · internal anchor
DALI dataset of 5358 tracks with aligned lyrics and notes is produced by iterative teacher-student singing-voice detection that refines web audio matches to initial karaoke annotations.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 18 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal cs.SD · 2025-11-26 · conditional · none · ref 22 · internal anchor
A black-box audio watermark removal attack trained on limited samples that generalizes across datasets and watermark schemes with high attack success rates.
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents cs.SD · 2025-09-15 · unverdicted · none · ref 6 · internal anchor
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation eess.AS · 2026-05-09 · unverdicted · none · ref 5
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention cs.SD · 2025-02-06 · unverdicted · none · ref 18 · internal anchor
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation cs.SD · 2026-05-01 · unverdicted · none · ref 6
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations eess.AS · 2026-04-16 · unverdicted · none · ref 47
UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 42
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences cs.SD · 2026-04-12 · unverdicted · none · ref 19
Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.
Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music cs.SD · 2026-04-12 · unverdicted · none · ref 23
Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation cs.SD · 2026-04-10 · unverdicted · none · ref 22
HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS) eess.AS · 2026-04-30 · unverdicted · none · ref 40
Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models eess.AS · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

Fma: A dataset for music analysis

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer