MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
hub
Fma: A dataset for music analysis
18 Pith papers cite this work. Polarity classification is still indexing.
abstract
We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma
hub tools
citation-role summary
citation-polarity summary
representative citing papers
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
DALI dataset of 5358 tracks with aligned lyrics and notes is produced by iterative teacher-student singing-voice detection that refines web audio matches to initial karaoke annotations.
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
A black-box audio watermark removal attack trained on limited samples that generalizes across datasets and watermark schemes with high attack success rates.
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.
Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.
HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
citing papers explorer
-
MusicDET: Zero-Shot AI-Generated Music Detection
MusicDET models the distribution of real music features with frequency-guided normalizing flows to detect AI-generated music as out-of-distribution samples in a zero-shot setting.
-
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
-
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm
DALI dataset of 5358 tracks with aligned lyrics and notes is produced by iterative teacher-student singing-voice detection that refines web audio matches to initial karaoke annotations.
-
Two-Dimensional Quantization for Geometry-Aware Audio Coding
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
-
HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal
A black-box audio watermark removal attack trained on limited samples that generalizes across datasets and watermark schemes with high attack success rates.
-
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.
-
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
-
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
-
UniPASE: A Generative Model for Universal Speech Enhancement with High Fidelity and Low Hallucinations
UniPASE extends the PASE framework with DeWavLM-Omni to convert degraded speech into high-fidelity, low-hallucination audio across sampling rates via phonetic enhancement, acoustic adaptation, and multi-rate vocoding.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Multimodal Dataset Normalization and Perceptual Validation for Music-Taste Correspondences
Music-flavor correspondences transfer from small human-annotated collections to large synthetic FMA datasets, with computational targets showing significant alignment to human listener ratings.
-
Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music
Mel-scale features exhibit measurable cultural bias with 12.5% higher WER on tonal languages and 15.7% F1 drop on non-Western music, while adaptive alternatives reduce these gaps substantially.
-
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
-
A Knowledge-Driven Approach to Target Speech Extraction in the Presence of Background Sound Effects for Cinematic Audio Source Separation (CASS)
Detecting manners of articulation and adding them as knowledge features improves target speech extraction in cinematic audio with background sounds.
-
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.