MusicDET: Zero-Shot AI-Generated Music Detection
Pith reviewed 2026-05-20 00:27 UTC · model grok-4.3
The pith
A model trained only on real music detects AI-generated tracks from new sources by measuring how unlikely each sample is under the learned real distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MusicDET learns the distribution of real music features with frequency-guided normalizing flows trained exclusively on authentic recordings. An input receives a low likelihood score if it falls outside this distribution and is therefore classified as AI-generated. The framework treats generated music from any source as out-of-distribution relative to the real-music model and reports stronger results than discriminative baselines on the FakeMusicCaps and SONICS datasets, especially for previously unseen generators.
What carries the argument
frequency-guided normalizing flows that model the probability distribution of real music features to identify out-of-distribution generated signals
If this is right
- Detection performance holds without any generated samples in the training set.
- The method outperforms standard discriminative detectors on both seen and unseen generators.
- Classification reduces to a simple likelihood threshold under the real-music distribution.
- The approach scales to new generators without retraining or data collection.
Where Pith is reading between the lines
- The same density-estimation strategy could extend to zero-shot detection of generated speech or video.
- Tighter feature spaces or alternative flow architectures might further improve separation between real and synthetic distributions.
- The work suggests distribution modeling offers a path to open-world detection that classification-based methods struggle to match.
Load-bearing premise
That frequency-guided normalizing flows can learn a distribution of real music features that is sufficiently tight and general to place music from unseen generators reliably outside that distribution.
What would settle it
A previously unseen music generator that produces tracks receiving high likelihood scores under the trained MusicDET model, comparable to real music.
Figures
read the original abstract
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MusicDET, a zero-shot framework for detecting AI-generated music. It trains frequency-guided normalizing flows exclusively on real music to model its feature distribution, then uses likelihood evaluation to flag generated samples as out-of-distribution. The central claim is that this generator-agnostic approach outperforms conventional discriminative detectors on the FakeMusicCaps and SONICS datasets, with particular gains for music from previously unseen generators.
Significance. If the results hold, the zero-shot formulation addresses a key practical limitation of existing detectors that require generated samples from specific models. A reliable density-based separation could enable more deployable systems for preserving musical authenticity. The use of normalizing flows for probabilistic modeling of real-music features is a reasonable technical choice, though its effectiveness on high-dimensional audio data remains to be demonstrated.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The claim that MusicDET 'consistently outperforms conventional discriminative detectors' is presented without any quantitative metrics, error bars, ablation results, dataset statistics, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains, which is load-bearing for the central empirical claim.
- [§3] §3 (Method), frequency-guided normalizing flow description: No details are given on flow depth, base distribution, conditioning mechanism, or regularization. Normalizing flows are known to underestimate density on complex, high-dimensional data with long-range temporal/spectral structure; without these specifics it is unclear whether the learned real-music manifold is tight enough to place unseen generator outputs in reliably low-likelihood regions, directly threatening the OOD detection premise.
minor comments (2)
- [§3] Notation for the likelihood threshold or decision rule is not explicitly defined; a clear equation would improve reproducibility.
- [§4] Dataset descriptions in §4 should include basic statistics (duration, sampling rate, number of tracks per split) to allow readers to judge the scope of the evaluation.
Simulated Author's Rebuttal
We thank the referee for their valuable comments. We respond to each major comment point by point and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4] The claim that MusicDET 'consistently outperforms conventional discriminative detectors' is presented without any quantitative metrics, error bars, ablation results, dataset statistics, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains, which is load-bearing for the central empirical claim.
Authors: We acknowledge the need for more explicit quantitative support for our claims. In the revised manuscript, we will expand the abstract and §4 to include specific performance metrics from our experiments on FakeMusicCaps and SONICS, such as AUC scores and accuracy for MusicDET compared to conventional discriminative detectors. We will also add error bars from multiple runs, ablation results on key components, and dataset statistics including sample sizes and generator information. This will allow for a better evaluation of the reported gains, particularly for unseen generators. revision: yes
-
Referee: [§3] No details are given on flow depth, base distribution, conditioning mechanism, or regularization. Normalizing flows are known to underestimate density on complex, high-dimensional data with long-range temporal/spectral structure; without these specifics it is unclear whether the learned real-music manifold is tight enough to place unseen generator outputs in reliably low-likelihood regions, directly threatening the OOD detection premise.
Authors: We agree that more architectural details are required for reproducibility and to address concerns about density estimation. In the revised §3, we will provide specifics on the frequency-guided normalizing flow, including the number of layers, the choice of base distribution, how frequency information is used for conditioning, and the regularization methods employed. Additionally, we will include a discussion on how our approach handles high-dimensional audio data and why the learned distribution effectively separates OOD samples, supported by the empirical results. We believe this will clarify the tightness of the manifold for OOD detection. revision: yes
Circularity Check
No circularity: standard density-estimation OOD detection with no self-referential reductions
full rationale
The paper formulates a zero-shot detection task and applies frequency-guided normalizing flows to learn a density over real-music features, then flags low-likelihood inputs as generated. This is a direct, non-circular use of existing normalizing-flow density estimation for out-of-distribution detection. No equations are shown that equate the detection score to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and the central claim does not rename a known empirical pattern or smuggle an ansatz via prior work. The derivation remains self-contained against external benchmarks for flow-based anomaly detection.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frequency-guided normalizing flows can accurately capture the distribution of real music features for out-of-distribution detection.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
frequency-guided normalizing flows that probabilistically models the distribution of real music features... log p_X(x) = log p_Z(h_K) + sum log |det J_fj|
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Band-Wise Normalizing Flows... Global Normalizing Flow... Gaussian prior p_Z(z) = N(μ_real, I)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AI- generated music detection and its challenges
Afchar, D., Meseguer-Brocal, G., and Hennequin, R. AI- generated music detection and its challenges. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[2]
MusicLM: Generating Music From Text
Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. MusicLM: Generating music from text.arXiv preprint arXiv:2301.11325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
A., Gibiansky, A., He, Q., Wu, J., Chang, M.- C., and Lyu, S
AlBadawy, E. A., Gibiansky, A., He, Q., Wu, J., Chang, M.- C., and Lyu, S. V ocBench: A neural vocoder benchmark for speech synthesis. InICASSP 2022-2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 881–885. IEEE,
work page 2022
-
[4]
MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies
Chen, K., Wu, Y ., Liu, H., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1206–1210. IEEE, 2024a. Chen, S., Wang, C., Chen, Z., Wu, Y ., Liu, S., Ch...
work page 2024
-
[5]
Singing voice graph modeling for singfake detection
Chen, X., Wu, H., Jang, R., and yi Lee, H. Singing voice graph modeling for singfake detection. InInterspeech 2024, pp. 4843–4847, 2024b. doi: 10.21437/Interspeech. 2024-1185. Chiu, L.-L. and Lai, S.-H. Self-supervised normalizing flows for image anomaly detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern r...
-
[6]
FMA: A Dataset For Music Analysis
URL https://arxiv.org/ abs/1612.01840. D´efossez, A., Copet, J., Synnaeve, G., and Adi, Y . High fidelity neural audio compression.Transactions on Ma- chine Learning Research,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J
Evans, Z., Parker, J. D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[8]
Au- dio features investigation for singing voice deepfake detection
Gohari, M., Salvi, D., Bestagini, P., and Adami, N. Au- dio features investigation for singing voice deepfake detection. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[9]
S., Lee, B.-J., Yu, H.-J., and Evans, N
Jung, J.-w., Heo, H.-S., Tak, H., Shim, H.-j., Chung, J. S., Lee, B.-J., Yu, H.-J., and Evans, N. AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6367–6371. IEEE,
work page 2022
-
[10]
SafeEar: Content privacy-preserving audio deepfake detection
Li, X., Li, K., Zheng, Y ., Yan, C., Ji, X., and Xu, W. SafeEar: Content privacy-preserving audio deepfake detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 3585–3599, 2024a. Li, Y ., Milling, M., Specia, L., and Schuller, B. W. From au- dio deepfake detection to AI-generated music detection–a path...
-
[11]
Mustango: Toward controllable text-to-music generation
Melechovsky, J., Guo, Z., Ghosal, D., Majumder, N., Herre- mans, D., and Poria, S. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8293–8316,
work page 2024
-
[12]
S., Chan, W., Zhang, Y ., Chiu, C.-C., Zoph, B., Cubuk, E
Park, D. S., Chan, W., Zhang, Y ., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V . SpecAugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, pp. 2613–2617,
work page 2019
-
[13]
Instruc- tAudio: Unified speech and music generation with natural language instruction
10 MusicDET: Zero-Shot AI-Generated Music Detection Qiang, C., Yin, K., Wang, X., Liang, Y ., Zhao, J., Fu, R., Wang, T., Gong, C., Zhang, C., Wang, L., et al. Instruc- tAudio: Unified speech and music generation with natural language instruction. InICASSP 2026-2026 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1...
work page 2026
-
[14]
Same same but differnet: Semi-supervised defect detection with nor- malizing flows
Rudolph, M., Wandt, B., and Rosenhahn, B. Same same but differnet: Semi-supervised defect detection with nor- malizing flows. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1907– 1916,
work page 1907
-
[15]
Tak, H., Todisco, M., Wang, X., Jung, J.-w., Yamagishi, J., and Evans, N. Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation. InProceedings of The Speaker and Lan- guage Recognition Workshop (Odyssey 2022), pp. 112– 119,
work page 2022
-
[16]
Todisco, M., Wang, X., Vestman, V ., Sahidullah, M., Del- gado, H., Nautsch, A., Yamagishi, J., Evans, N., Kin- nunen, T., and Lee, K. A. ASVspoof 2019: Future hori- zons in spoofed and fake audio detection. InProceedings of Interspeech, pp. 1008–1012,
work page 2019
-
[17]
Wang, R., Chen, Z., Wang, B., Ba, Z., and Ren, K. AWave- Former: Audio wavelet transformer network for gener- alized audio deepfake detection.IEEE Transactions on Audio, Speech and Language Processing, 2025a. Wang, Z., Ye, D., Li, J., and Deng, J. Generalize audio deep- fake algorithm recognition via attribution enhancement. InICASSP 2025-2025 IEEE Intern...
work page 2025
-
[18]
Xu, Z., Dutta, D., Wei, Y .-L., and Choudhury, R. R. Multi- source music generation with latent diffusion. InAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation,
work page 2024
-
[19]
A robust audio deepfake detection system via multi-view feature
Yang, Y ., Qin, H., Zhou, H., Wang, C., Guo, T., Han, K., and Wang, Y . A robust audio deepfake detection system via multi-view feature. InICASSP 2024-2024 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13131–13135. IEEE,
work page 2024
-
[20]
Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M
doi: 10.21437/Interspeech.2024-2242. Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. ´A., Jaitly, N., and Susskind, J. M. Normalizing flows are capable generative models. InForty-second International Conference on Machine Learning, ICML,
-
[21]
Zhang, J., Parada, P. P., Jalal, M. A., and Saravanan, K. Diffusion based text-to-music generation with global and local text based conditioning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,
work page 2025
-
[22]
12 MusicDET: Zero-Shot AI-Generated Music Detection A. Appendix A.1. Preliminaries of Normalizing Flows Normalizing flows are likelihood-based generative mod- els that enable exact density estimation via a sequence of invertible transformations between the data space and a la- tent space with a tractable prior distribution. By explicitly modeling the dist...
work page 2003
-
[23]
Each reconstructed sample is paired with its corresponding original, forming areal vs
at bitrates of 3, 6, and 24 kbps, and (ii) GrifMel (AlBadawy et al., 2022), a Griffin–Lim-based mel inversion pipeline with 256 and 512 mel bins. Each reconstructed sample is paired with its corresponding original, forming areal vs. re- constructeddiscrimination task. We then train and evaluate class-conditional MusicDET (a) under within-family trans- fer...
work page 2022
-
[24]
and MusicDET under cross- generator evaluation, where the y-axis denotes the training subset and the x-axis denotes the test subset. In thewithin- familysetting, both methods transfer well across EnCodec operating points and GrifMel mel bins. In thecross-family setting, Spec-ResNet generalizes poorly beyond its training family. When trained on EnCodec, ou...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.