arxiv: 2604.16254 · v2 · submitted 2026-04-17 · 💻 cs.SD · eess.AS

Recognition: unknown

ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics

Heewon Oh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:17 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords AI-generated music detectionforensic physicscodec residualsneural audio codecsArtifactBenchzero-shot evaluationUNetHPSS

0 comments

The pith

AI-generated music carries detectable physical artifacts from neural codecs that a compact network can extract for reliable identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that detecting AI-generated music is best approached by extracting the physical artifacts left by neural audio codecs rather than by learning abstract representations. A lightweight network pulls these residuals from spectrograms, breaks them into forensic features, and classifies them, all while remaining efficient. It demonstrates this on a benchmark covering many generators and real tracks, with strong results on data the model has not seen during training and better stability across audio formats. This matters because scalable and reliable detection of synthetic music is needed as generative tools become widespread.

Core claim

ArtifactNet reframes AI music detection as the extraction and analysis of physical artifacts that neural audio codecs imprint on generated audio. A bounded-mask UNet extracts codec residuals from magnitude spectrograms, which are decomposed via HPSS into 7-channel forensic features and classified by a compact CNN. On the unseen test partition of ArtifactBench, it achieves F1 = 0.9829 with low false positive rate, outperforming larger representation-learning baselines while using 49 times fewer parameters than one and 4.8 times fewer than another. Codec-aware training with augmentation across WAV, MP3, AAC, and Opus reduces cross-codec probability drift by 83 percent.

What carries the argument

The bounded-mask UNet (ArtifactUNet) that extracts codec residuals from magnitude spectrograms for subsequent HPSS decomposition into forensic features.

If this is right

Detection becomes feasible with models small enough for edge or real-time deployment.
Performance holds across different audio codecs due to the targeted augmentation strategy.
Generalization to unseen generators improves by focusing on physical codec traces instead of learned representations.
A multi-generator benchmark enables fair zero-shot testing of future detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The forensic residual approach could extend to detecting synthetic speech or other audio by isolating similar generation signatures.
It suggests that codec-level artifacts may remain a reliable signal even as generative models evolve and diversify.
Real-world deployment would benefit from testing on mixed or edited audio that combines AI and human sources.

Load-bearing premise

The extracted codec residuals are caused by the AI generation process itself rather than by shared post-processing, real-world recording artifacts, or the specific training distributions of the generators in the benchmark.

What would settle it

A test set of AI-generated tracks from new generators that avoid standard neural codecs or apply heavy post-processing, or a set of real tracks processed identically, would show whether the high detection accuracy and low false positives hold.

Figures

Figures reproduced from arXiv: 2604.16254 by Heewon Oh.

**Figure 2.** Figure 2: Baseline vs. codec-aware UNet probability distributions across four [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: SONICS full-test three-way comparison, 4-panel: (A) overall metrics at [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Aggregate F1, Precision, Recall, and FPR comparison of ArtifactNet, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Effective bandwidth of source-separation residuals by generator. AI [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: ROC curve and F1 score vs. threshold for ArtifactNet on Artifact [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) extracts codec residuals from magnitude spectrograms, which are then decomposed via HPSS into 7-channel forensic features for classification by a compact CNN (0.4M parameters; 4.0M total). We introduce ArtifactBench, a multi-generator evaluation benchmark comprising 6,183 tracks (4,383 AI from 22 generators and 1,800 real from 6 diverse sources). Each track is tagged with bench_origin for fair zero-shot evaluation. On the unseen test partition (n=2,263), ArtifactNet achieves F1 = 0.9829 with FPR = 1.49%, compared to CLAM (F1 = 0.7576, FPR = 69.26%) and SpecTTTra (F1 = 0.7713, FPR = 19.43%) evaluated under identical conditions with published checkpoints. Codec-aware training (4-way WAV/MP3/AAC/Opus augmentation) further reduces cross-codec probability drift by 83% (Delta = 0.95 -> 0.16), resolving the primary codec-invariance failure mode. These results establish forensic physics -- direct extraction of codec-level artifacts -- as a more generalizable and parameter-efficient paradigm for AI music detection than representation learning, using 49x fewer parameters than CLAM and 4.8x fewer than SpecTTTra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ArtifactNet, a lightweight framework for detecting AI-generated music by extracting codec-level forensic residuals. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) processes magnitude spectrograms to isolate residuals, which are decomposed via HPSS into 7-channel features and classified by a compact CNN (0.4M parameters; 4M total). ArtifactBench is presented as a benchmark with 6,183 tracks (4,383 AI from 22 generators, 1,800 real from 6 sources), each tagged with bench_origin for zero-shot evaluation. On the unseen test partition (n=2,263), the model reports F1=0.9829 and FPR=1.49%, outperforming CLAM (F1=0.7576) and SpecTTTra (F1=0.7713) under identical conditions while using 49x and 4.8x fewer parameters, respectively. Codec-aware 4-way augmentation reduces cross-codec drift by 83%.

Significance. If the central claims hold, this establishes forensic residual extraction as a more parameter-efficient and generalizable paradigm than representation learning for AI music detection. The concrete empirical results (F1, FPR, drift reduction) on a held-out multi-generator benchmark, direct baseline comparisons, and total model size of 4M parameters are strengths that could influence practical deployment. The benchmark itself is a useful contribution for the field.

major comments (3)

[§4.2] §4.2 (Zero-shot Evaluation and Benchmark): The unseen partition (n=2,263) is described as fair via bench_origin tags, but the manuscript does not explicitly confirm that entire generators are held out (as opposed to individual tracks). Without this, the F1=0.9829 may reflect leakage of generator-specific traits rather than generalizable forensic physics.
[§5] §5 (Experimental Results): No control experiment is reported in which real tracks are passed through the same neural codecs (WAV/MP3/AAC/Opus) used by the 22 generators to verify that extracted residuals are caused by AI generation physics rather than shared post-processing or recording chains present in ArtifactBench.
[§3.2] §3.2 (HPSS Decomposition): The contribution of the HPSS step to the 7-channel forensic features is not ablated, leaving open whether the performance gains derive from the bounded-mask UNet residuals or from the subsequent decomposition.

minor comments (2)

[Abstract] The abstract omits any mention of training procedure, exact data splits, or statistical significance testing for the reported metrics, which would aid assessment even if details appear later in the manuscript.
[§3.1] Notation for the bounded-mask mechanism in ArtifactUNet could be clarified with an equation or diagram to make the residual extraction process more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough and constructive review of our manuscript. We appreciate the referee's focus on methodological clarity and experimental rigor. Below we provide point-by-point responses to the major comments. We have revised the manuscript where the concerns can be directly addressed through clarification or additional analysis.

read point-by-point responses

Referee: [§4.2] §4.2 (Zero-shot Evaluation and Benchmark): The unseen partition (n=2,263) is described as fair via bench_origin tags, but the manuscript does not explicitly confirm that entire generators are held out (as opposed to individual tracks). Without this, the F1=0.9829 may reflect leakage of generator-specific traits rather than generalizable forensic physics.

Authors: We thank the referee for identifying this ambiguity. The bench_origin tags ensure that the unseen test partition contains entirely new tracks with no overlap against the training set, enabling track-level zero-shot evaluation across all 22 generators. However, to maintain benchmark scale and diversity, some generators contribute tracks to both partitions. We contend that this does not introduce problematic leakage of generator-specific traits, as ArtifactNet targets codec-level forensic residuals that arise from the shared neural audio synthesis physics rather than model-specific stylistic artifacts. In the revised manuscript we will expand §4.2 to explicitly report the generator overlap statistics between partitions and provide a brief justification for preferring track-level hold-out given the forensic-physics framing. revision: partial
Referee: [§5] §5 (Experimental Results): No control experiment is reported in which real tracks are passed through the same neural codecs (WAV/MP3/AAC/Opus) used by the 22 generators to verify that extracted residuals are caused by AI generation physics rather than shared post-processing or recording chains present in ArtifactBench.

Authors: We agree this control would strengthen the causal claim. In the revised manuscript we will add a control experiment in §5: a subset of real tracks from ArtifactBench will be re-encoded with the identical neural codecs used by the AI generators. We will then run the full ArtifactNet pipeline on these re-encoded real signals and report the resulting FPR to demonstrate that codec post-processing alone does not produce the same residual signatures or classification decisions as actual AI-generated content. revision: yes
Referee: [§3.2] §3.2 (HPSS Decomposition): The contribution of the HPSS step to the 7-channel forensic features is not ablated, leaving open whether the performance gains derive from the bounded-mask UNet residuals or from the subsequent decomposition.

Authors: We acknowledge the value of isolating the HPSS contribution. The revised manuscript will include an ablation in §3.2 and §5 that compares the full 7-channel HPSS-augmented pipeline against a baseline that feeds the raw 3-channel residuals from ArtifactUNet directly into the classifier (bypassing HPSS). This will quantify the incremental performance benefit attributable to the harmonic-percussive decomposition step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on independent benchmark

full rationale

The paper's core contribution is an empirical ML pipeline (bounded-mask UNet + HPSS + CNN) trained and evaluated on the newly constructed ArtifactBench dataset with explicit zero-shot partitions. No equations, fitted parameters, or derivations are presented that reduce any claimed prediction or forensic feature to the inputs by construction. Performance numbers (F1=0.9829, FPR=1.49%) are measured outcomes on held-out tracks rather than quantities defined in terms of the model's own parameters or self-referential fits. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner to justify the central forensic-physics claim. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that neural audio codecs leave detectable, generalizable residuals in AI-generated audio that are absent or distinguishable in real audio. Model weights (3.6M + 0.4M) are fitted parameters whose values are not reported.

free parameters (2)

ArtifactUNet weights
3.6 million parameters learned from training data to extract residuals.
Classifier CNN weights
0.4 million parameters learned to map 7-channel features to real/AI labels.

axioms (1)

domain assumption Neural audio codecs imprint detectable physical artifacts on generated audio that differ systematically from real recordings
Invoked as the foundation for reframing detection as forensic physics.

pith-pipeline@v0.9.0 · 5597 in / 1260 out tokens · 62926 ms · 2026-05-10T07:17:41.728839+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Ai-generated music survey,

Deezer and Ipsos, “Ai-generated music survey,” https: //newsroom-deezer.com/2025/11/deezer-ipsos-survey-ai-music/, Nov. 2025, newsroom-deezer.com

2025
[2]

Melody or machine: Detecting synthetic music with dual-stream contrastive learning,

A. Batraet al., “Melody or machine: Detecting synthetic music with dual-stream contrastive learning,”arXiv preprint arXiv:2512.00621, 2025

work page arXiv 2025
[3]

SONICS: Synthetic or not — identifying counterfeit songs,

M. A. Rahmanet al., “SONICS: Synthetic or not — identifying counterfeit songs,” inProc. Int. Conf. on Learning Representations (ICLR), 2025

2025
[4]

Detecting mu- sic deepfakes is easy but actually hard,

D. Afchar, G. Meseguer-Brocal, and R. Hennequin, “Detecting mu- sic deepfakes is easy but actually hard,”Proc. IEEE ICASSP, 2025, arXiv:2405.04181

work page arXiv 2025
[5]

A Fourier explanation of AI-music artifacts,

D. Afchar, G. Meseguer-Brocal, K. Akesbi, and R. Hennequin, “A Fourier explanation of AI-music artifacts,” inProc. Int. Society for Music Information Retrieval Conf. (ISMIR), 2025, arXiv:2506.19108, Best Paper Award

work page arXiv 2025
[6]

High fidelity neural audio compression,

A. D ´efossezet al., “High fidelity neural audio compression,”Trans. Machine Learning Research, 2023

2023
[7]

High-fidelity audio compression with improved RVQ- GAN,

R. Kumaret al., “High-fidelity audio compression with improved RVQ- GAN,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[8]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inProc. Advances in Neural Information Pro- cessing Systems (NeurIPS), 2017

2017
[9]

Simple and controllable music generation,

J. Copetet al., “Simple and controllable music generation,” inProc. Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[10]

Gyenge, Roger B

Y . Liet al., “MERT: Acoustic music understanding model with large- scale self-supervised training,”arXiv preprint arXiv:2306.00107, 2023

work page arXiv 2023
[11]

Audio deepfake detection: A survey,

J. Yi, C. Wang, and J. Tao, “Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023

work page arXiv 2023
[12]

From audio deepfake detection to AI-generated music detection — a pathway and overview,

Y . Liet al., “From audio deepfake detection to AI-generated music detection — a pathway and overview,”arXiv preprint arXiv:2412.00571, 2024

work page arXiv 2024
[13]

SoundStream: An end-to-end neural audio codec,

N. Zeghidouret al., “SoundStream: An end-to-end neural audio codec,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 495– 507, 2022

2022
[14]

MusicLM: Generating Music From Text

A. Agostinelliet al., “MusicLM: Generating music from text,”arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review arXiv 2023
[15]

Jukebox: A Generative Model for Music

P. Dhariwalet al., “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

work page Pith review arXiv 2005
[16]

Stable audio open,

Z. Evanset al., “Stable audio open,”arXiv preprint arXiv:2407.14358, 2024

work page arXiv 2024
[17]

AudioLDM: Text-to-audio generation with latent diffusion models,

H. Liuet al., “AudioLDM: Text-to-audio generation with latent diffusion models,” inProc. Int. Conf. on Machine Learning (ICML), 2023

2023
[18]

Hybrid Transformers for music source separation.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.https://arxiv.org/abs/2211.08553

S. Rouard, F. Massa, and A. D ´efossez, “Hybrid transformers for music source separation,” inProc. IEEE ICASSP, 2023, arXiv:2211.08553

work page arXiv 2023
[19]

Music source separation in the waveform domain

A. D ´efossezet al., “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019

work page arXiv 1911
[20]

Harmonic/percussive separation using median filtering,

D. Fitzgerald, “Harmonic/percussive separation using median filtering,” inProc. Int. Conf. on Digital Audio Effects (DAFx), 2010

2010
[21]

Diffusion noise feature: Accurate and fast generated image detection,

Y . Zhang and X. Xu, “Diffusion noise feature: Accurate and fast generated image detection,”arXiv preprint arXiv:2312.02625, 2023

work page arXiv 2023
[22]

End-to-end anti-spoofing with RawNet2,

H. Taket al., “End-to-end anti-spoofing with RawNet2,” inProc. IEEE ICASSP, 2021

2021
[23]

AASIST: Audio anti-spoofing using integrated spectro- temporal graph attention networks,

J. Junget al., “AASIST: Audio anti-spoofing using integrated spectro- temporal graph attention networks,” inProc. IEEE ICASSP, 2022

2022
[24]

ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,

X. Liu, X. Wang, M. Sahidullahet al., “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Trans. Audio, Speech, Language Process., 2023

2021
[25]

U-Net: Convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- works for biomedical image segmentation,” inProc. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

2015
[26]

Distilling the knowledge in a neural network,

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” inNIPS Deep Learning Workshop, 2015

2015
[27]

FMA: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” inProc. Int. Society for Music Information Retrieval Conf. (ISMIR), 2017

2017
[28]

AI-generated music detection and its challenges,

D. Afchar, G. Meseguer-Brocal, and R. Hennequin, “AI-generated music detection and its challenges,”arXiv preprint arXiv:2501.10111, 2025

work page arXiv 2025
[29]

Do ImageNet classifiers generalize to ImageNet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do ImageNet classifiers generalize to ImageNet?” inProc. Int. Conf. on Machine Learning (ICML), 2019, pp. 5389–5400

2019
[30]

Proactive detection of voice cloning with localized watermarking,

R. San Romanet al., “Proactive detection of voice cloning with localized watermarking,” inProc. Int. Conf. on Machine Learning (ICML), 2024

2024