pith. sign in

arxiv: 2605.18072 · v1 · pith:LMPC7F5Snew · submitted 2026-05-18 · 💻 cs.SD

MusicDET: Zero-Shot AI-Generated Music Detection

Pith reviewed 2026-05-20 00:27 UTC · model grok-4.3

classification 💻 cs.SD
keywords AI-generated music detectionzero-shot detectionnormalizing flowsout-of-distribution detectionmusic generationgenerative modelsfrequency-guided features
0
0 comments X

The pith

A model trained only on real music detects AI-generated tracks from new sources by measuring how unlikely each sample is under the learned real distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing detectors for AI-generated music lose accuracy when tested on generators they never saw during training because they rely on examples of fakes. This paper shows that a zero-shot alternative works by learning the probability distribution of real music features alone and then flagging low-likelihood inputs as generated. The method uses frequency-guided normalizing flows to build that distribution without ever seeing synthetic samples. A sympathetic reader would care because music generators improve quickly and any practical detector must handle unknown sources without constant retraining or new data collection.

Core claim

MusicDET learns the distribution of real music features with frequency-guided normalizing flows trained exclusively on authentic recordings. An input receives a low likelihood score if it falls outside this distribution and is therefore classified as AI-generated. The framework treats generated music from any source as out-of-distribution relative to the real-music model and reports stronger results than discriminative baselines on the FakeMusicCaps and SONICS datasets, especially for previously unseen generators.

What carries the argument

frequency-guided normalizing flows that model the probability distribution of real music features to identify out-of-distribution generated signals

If this is right

  • Detection performance holds without any generated samples in the training set.
  • The method outperforms standard discriminative detectors on both seen and unseen generators.
  • Classification reduces to a simple likelihood threshold under the real-music distribution.
  • The approach scales to new generators without retraining or data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same density-estimation strategy could extend to zero-shot detection of generated speech or video.
  • Tighter feature spaces or alternative flow architectures might further improve separation between real and synthetic distributions.
  • The work suggests distribution modeling offers a path to open-world detection that classification-based methods struggle to match.

Load-bearing premise

That frequency-guided normalizing flows can learn a distribution of real music features that is sufficiently tight and general to place music from unseen generators reliably outside that distribution.

What would settle it

A previously unseen music generator that produces tracks receiving high likelihood scores under the trained MusicDET model, comparable to real music.

Figures

Figures reproduced from arXiv: 2605.18072 by Chaolei Han, Hongsong Wang, Jie Gui.

Figure 1
Figure 1. Figure 1: Systematic discrepancies between real music and AI￾generated music in terms of energy spectrograms. Real music exhibits coherent and well-organized time–frequency structures, whereas AI-generated music often displays irregular and less con￾sistent spectral energy patterns. distribution, and consumption (Zhang et al., 2025; Schnei￾der et al., 2024; Tian et al., 2026; Bryan-Kinns et al., 2024). While these t… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of MusicDET. In the zero-shot setting, MusicDET converts real-music waveforms into energy spectrograms and decomposes the resulting features into multiple frequency sub-bands via Frequency-Wise Decomposition. Each sub-band is processed by an independent Band-Wise Normalizing Flow, which performs invertible transformations to learn band-specific representations. The resulting latent codes are then … view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of class-conditional MusicDET. The normaliz￾ing flow learns class-conditional probability density functions for real music and AI-generated music via invertible transformations, enabling detection through likelihood estimation. music samples: min θ Ex∼Dreal − log pX(x) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of cross-generator generalization. We select W2V2-AASIST, MERT-AASIST, and SpecTTTra-α as representa￾tive countermeasures. Each model is trained on one of the five specific subsets of FakeMusicCaps and evaluated on all subsets. In each confusion matrix, the vertical axis denotes the training subset and the horizontal axis denotes the test subset. different training subsets, on FakeMusicCaps and … view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter Analysis. (a) Effect of the number of frequency bands and the depth of the band-wise normalizing flows. (b) Effect of the prior mean µreal in the normalizing flows [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion matrices for reconstructed music with overall accuracy. (a) Spec-ResNet fails to transfer across re￾construction families. (b) Class-Conditional MusicDET achieves better generalization, where training on higher-quality reconstruc￾tions enables reliable detection on lower-quality reconstructions. A.2. Evaluation on EnCodec-Reconstructed Music Motivation. Directly comparing real and synthetic music… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of MERT-AASIST and MusicDET Predictions on real and AI-generated music samples. The left column corresponds to real samples, while the right column shows fake counterparts. For each sample, we visualize the raw waveform, the energy spectrum, and the corresponding model prediction results. A.3. Visualizations of Music Samples As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MusicDET, a zero-shot framework for detecting AI-generated music. It trains frequency-guided normalizing flows exclusively on real music to model its feature distribution, then uses likelihood evaluation to flag generated samples as out-of-distribution. The central claim is that this generator-agnostic approach outperforms conventional discriminative detectors on the FakeMusicCaps and SONICS datasets, with particular gains for music from previously unseen generators.

Significance. If the results hold, the zero-shot formulation addresses a key practical limitation of existing detectors that require generated samples from specific models. A reliable density-based separation could enable more deployable systems for preserving musical authenticity. The use of normalizing flows for probabilistic modeling of real-music features is a reasonable technical choice, though its effectiveness on high-dimensional audio data remains to be demonstrated.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The claim that MusicDET 'consistently outperforms conventional discriminative detectors' is presented without any quantitative metrics, error bars, ablation results, dataset statistics, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains, which is load-bearing for the central empirical claim.
  2. [§3] §3 (Method), frequency-guided normalizing flow description: No details are given on flow depth, base distribution, conditioning mechanism, or regularization. Normalizing flows are known to underestimate density on complex, high-dimensional data with long-range temporal/spectral structure; without these specifics it is unclear whether the learned real-music manifold is tight enough to place unseen generator outputs in reliably low-likelihood regions, directly threatening the OOD detection premise.
minor comments (2)
  1. [§3] Notation for the likelihood threshold or decision rule is not explicitly defined; a clear equation would improve reproducibility.
  2. [§4] Dataset descriptions in §4 should include basic statistics (duration, sampling rate, number of tracks per split) to allow readers to judge the scope of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments. We respond to each major comment point by point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] The claim that MusicDET 'consistently outperforms conventional discriminative detectors' is presented without any quantitative metrics, error bars, ablation results, dataset statistics, or baseline numbers. This absence makes it impossible to assess the magnitude or statistical reliability of the reported gains, which is load-bearing for the central empirical claim.

    Authors: We acknowledge the need for more explicit quantitative support for our claims. In the revised manuscript, we will expand the abstract and §4 to include specific performance metrics from our experiments on FakeMusicCaps and SONICS, such as AUC scores and accuracy for MusicDET compared to conventional discriminative detectors. We will also add error bars from multiple runs, ablation results on key components, and dataset statistics including sample sizes and generator information. This will allow for a better evaluation of the reported gains, particularly for unseen generators. revision: yes

  2. Referee: [§3] No details are given on flow depth, base distribution, conditioning mechanism, or regularization. Normalizing flows are known to underestimate density on complex, high-dimensional data with long-range temporal/spectral structure; without these specifics it is unclear whether the learned real-music manifold is tight enough to place unseen generator outputs in reliably low-likelihood regions, directly threatening the OOD detection premise.

    Authors: We agree that more architectural details are required for reproducibility and to address concerns about density estimation. In the revised §3, we will provide specifics on the frequency-guided normalizing flow, including the number of layers, the choice of base distribution, how frequency information is used for conditioning, and the regularization methods employed. Additionally, we will include a discussion on how our approach handles high-dimensional audio data and why the learned distribution effectively separates OOD samples, supported by the empirical results. We believe this will clarify the tightness of the manifold for OOD detection. revision: yes

Circularity Check

0 steps flagged

No circularity: standard density-estimation OOD detection with no self-referential reductions

full rationale

The paper formulates a zero-shot detection task and applies frequency-guided normalizing flows to learn a density over real-music features, then flags low-likelihood inputs as generated. This is a direct, non-circular use of existing normalizing-flow density estimation for out-of-distribution detection. No equations are shown that equate the detection score to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem or load-bearing premise, and the central claim does not rename a known empirical pattern or smuggle an ansatz via prior work. The derivation remains self-contained against external benchmarks for flow-based anomaly detection.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that real music features admit a learnable probabilistic distribution separable from generated music; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Frequency-guided normalizing flows can accurately capture the distribution of real music features for out-of-distribution detection.
    Invoked as the basis for likelihood-based detection of generated samples.

pith-pipeline@v0.9.0 · 5680 in / 1187 out tokens · 41601 ms · 2026-05-20T00:27:00.749217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    AI- generated music detection and its challenges

    Afchar, D., Meseguer-Brocal, G., and Hennequin, R. AI- generated music detection and its challenges. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  2. [2]

    MusicLM: Generating Music From Text

    Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. MusicLM: Generating music from text.arXiv preprint arXiv:2301.11325,

  3. [3]

    A., Gibiansky, A., He, Q., Wu, J., Chang, M.- C., and Lyu, S

    AlBadawy, E. A., Gibiansky, A., He, Q., Wu, J., Chang, M.- C., and Lyu, S. V ocBench: A neural vocoder benchmark for speech synthesis. InICASSP 2022-2022 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 881–885. IEEE,

  4. [4]

    MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies

    Chen, K., Wu, Y ., Liu, H., Nezhurina, M., Berg-Kirkpatrick, T., and Dubnov, S. MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1206–1210. IEEE, 2024a. Chen, S., Wang, C., Chen, Z., Wu, Y ., Liu, S., Ch...

  5. [5]

    Singing voice graph modeling for singfake detection

    Chen, X., Wu, H., Jang, R., and yi Lee, H. Singing voice graph modeling for singfake detection. InInterspeech 2024, pp. 4843–4847, 2024b. doi: 10.21437/Interspeech. 2024-1185. Chiu, L.-L. and Lai, S.-H. Self-supervised normalizing flows for image anomaly detection and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern r...

  6. [6]

    FMA: A Dataset For Music Analysis

    URL https://arxiv.org/ abs/1612.01840. D´efossez, A., Copet, J., Synnaeve, G., and Adi, Y . High fidelity neural audio compression.Transactions on Ma- chine Learning Research,

  7. [7]

    D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J

    Evans, Z., Parker, J. D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  8. [8]

    Au- dio features investigation for singing voice deepfake detection

    Gohari, M., Salvi, D., Bestagini, P., and Adami, N. Au- dio features investigation for singing voice deepfake detection. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  9. [9]

    S., Lee, B.-J., Yu, H.-J., and Evans, N

    Jung, J.-w., Heo, H.-S., Tak, H., Shim, H.-j., Chung, J. S., Lee, B.-J., Yu, H.-J., and Evans, N. AASIST: Audio anti- spoofing using integrated spectro-temporal graph atten- tion networks. InICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6367–6371. IEEE,

  10. [10]

    SafeEar: Content privacy-preserving audio deepfake detection

    Li, X., Li, K., Zheng, Y ., Yan, C., Ji, X., and Xu, W. SafeEar: Content privacy-preserving audio deepfake detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 3585–3599, 2024a. Li, Y ., Milling, M., Specia, L., and Schuller, B. W. From au- dio deepfake detection to AI-generated music detection–a path...

  11. [11]

    Mustango: Toward controllable text-to-music generation

    Melechovsky, J., Guo, Z., Ghosal, D., Majumder, N., Herre- mans, D., and Poria, S. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8293–8316,

  12. [12]

    S., Chan, W., Zhang, Y ., Chiu, C.-C., Zoph, B., Cubuk, E

    Park, D. S., Chan, W., Zhang, Y ., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V . SpecAugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, pp. 2613–2617,

  13. [13]

    Instruc- tAudio: Unified speech and music generation with natural language instruction

    10 MusicDET: Zero-Shot AI-Generated Music Detection Qiang, C., Yin, K., Wang, X., Liang, Y ., Zhao, J., Fu, R., Wang, T., Gong, C., Zhang, C., Wang, L., et al. Instruc- tAudio: Unified speech and music generation with natural language instruction. InICASSP 2026-2026 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1...

  14. [14]

    Same same but differnet: Semi-supervised defect detection with nor- malizing flows

    Rudolph, M., Wandt, B., and Rosenhahn, B. Same same but differnet: Semi-supervised defect detection with nor- malizing flows. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1907– 1916,

  15. [15]

    Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation

    Tak, H., Todisco, M., Wang, X., Jung, J.-w., Yamagishi, J., and Evans, N. Automatic speaker verification spoof- ing and deepfake detection using wav2vec 2.0 and data augmentation. InProceedings of The Speaker and Lan- guage Recognition Workshop (Odyssey 2022), pp. 112– 119,

  16. [16]

    Todisco, M., Wang, X., Vestman, V ., Sahidullah, M., Del- gado, H., Nautsch, A., Yamagishi, J., Evans, N., Kin- nunen, T., and Lee, K. A. ASVspoof 2019: Future hori- zons in spoofed and fake audio detection. InProceedings of Interspeech, pp. 1008–1012,

  17. [17]

    AWave- Former: Audio wavelet transformer network for gener- alized audio deepfake detection.IEEE Transactions on Audio, Speech and Language Processing, 2025a

    Wang, R., Chen, Z., Wang, B., Ba, Z., and Ren, K. AWave- Former: Audio wavelet transformer network for gener- alized audio deepfake detection.IEEE Transactions on Audio, Speech and Language Processing, 2025a. Wang, Z., Ye, D., Li, J., and Deng, J. Generalize audio deep- fake algorithm recognition via attribution enhancement. InICASSP 2025-2025 IEEE Intern...

  18. [18]

    Xu, Z., Dutta, D., Wei, Y .-L., and Choudhury, R. R. Multi- source music generation with latent diffusion. InAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation,

  19. [19]

    A robust audio deepfake detection system via multi-view feature

    Yang, Y ., Qin, H., Zhou, H., Wang, C., Guo, T., Han, K., and Wang, Y . A robust audio deepfake detection system via multi-view feature. InICASSP 2024-2024 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13131–13135. IEEE,

  20. [20]

    Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M

    doi: 10.21437/Interspeech.2024-2242. Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. ´A., Jaitly, N., and Susskind, J. M. Normalizing flows are capable generative models. InForty-second International Conference on Machine Learning, ICML,

  21. [21]

    P., Jalal, M

    Zhang, J., Parada, P. P., Jalal, M. A., and Saravanan, K. Diffusion based text-to-music generation with global and local text based conditioning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  22. [22]

    Appendix A.1

    12 MusicDET: Zero-Shot AI-Generated Music Detection A. Appendix A.1. Preliminaries of Normalizing Flows Normalizing flows are likelihood-based generative mod- els that enable exact density estimation via a sequence of invertible transformations between the data space and a la- tent space with a tractable prior distribution. By explicitly modeling the dist...

  23. [23]

    Each reconstructed sample is paired with its corresponding original, forming areal vs

    at bitrates of 3, 6, and 24 kbps, and (ii) GrifMel (AlBadawy et al., 2022), a Griffin–Lim-based mel inversion pipeline with 256 and 512 mel bins. Each reconstructed sample is paired with its corresponding original, forming areal vs. re- constructeddiscrimination task. We then train and evaluate class-conditional MusicDET (a) under within-family trans- fer...

  24. [24]

    In thewithin- familysetting, both methods transfer well across EnCodec operating points and GrifMel mel bins

    and MusicDET under cross- generator evaluation, where the y-axis denotes the training subset and the x-axis denotes the test subset. In thewithin- familysetting, both methods transfer well across EnCodec operating points and GrifMel mel bins. In thecross-family setting, Spec-ResNet generalizes poorly beyond its training family. When trained on EnCodec, ou...