Probing Token Spaces under Generator Shift in AI-Generated Music Detection

Joonyong Park; Jungwoo Kim; Junyoung Koh; Yuki Saito

arxiv: 2606.08663 · v1 · pith:HPASVBDCnew · submitted 2026-06-07 · 💻 cs.SD · eess.AS

Probing Token Spaces under Generator Shift in AI-Generated Music Detection

Joonyong Park , Jungwoo Kim , Junyoung Koh , Yuki Saito This is my paper

Pith reviewed 2026-06-27 18:02 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords AI-generated music detectiongenerator shifttoken spacesaudio codecstransfer learningsource-restricted evaluationCoMoE

0 comments

The pith

Codec-style token spaces vary sharply in performance when AI music detectors face new generators not seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether AI-generated music detectors transfer to new generator sources by creating an open version of a benchmark dataset called MoM-open. It introduces a fixed classifier called CoMoE that lets researchers swap different audio token representations without changing anything else about the model or training. Standard test splits turn out to be too easy and hide differences, but when the test uses only new fake music sources, token choice produces large gaps in accuracy. This leads to the suggestion that the type of token space used should be a main focus when building detectors meant to work across changing generators.

Core claim

Using source-restricted evaluation on the MoM-open dataset, the experiments demonstrate that fake-source restriction exposes large differences between token spaces, with X-Codec tokens strongest when training on Udio alone and MERT-derived tokens stronger when training on Suno-v3.5 alone, while standard splits are nearly saturated.

What carries the argument

CoMoE, a compact fixed classifier designed to compare heterogeneous audio token spaces while holding the downstream architecture and training recipe constant.

If this is right

Standard and real-source-restricted evaluation splits saturate and fail to distinguish token spaces.
Fake-source restriction reveals that X-Codec tokens excel for Udio shifts while MERT tokens excel for Suno-v3.5 shifts.
Codec-style discrete token spaces should be treated as a primary experimental axis when studying generator shift in detection.
The choice of token representation affects generalization more than the classifier architecture in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future detectors may need to ensemble multiple token spaces to cover different generator shifts.
Similar token-space sensitivity could appear in other generative media like images or video when facing new generators.
Open reconstructions like MoM-open enable community testing of transfer without relying on restricted data.

Load-bearing premise

The open reconstruction MoM-open using FMA and MTG-Jamendo keeps the fake-generator protocol close enough to the original to separate token effects from data differences.

What would settle it

A direct comparison showing that the performance gaps between token spaces disappear when the same models are evaluated on the original non-open MoM-CLAM splits would falsify the isolation of token-space effects.

Figures

Figures reproduced from arXiv: 2606.08663 by Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito.

**Figure 1.** Figure 1: Architecture of COMOE. audio token spaces; (ii) construct MOM-OPEN with sourcerestricted evaluation splits; and (iii) show that tokenizer choice is a primary experimental variable for cross-generator music deepfake detection. 2. Related Work Neural audio codecs and forensic cues. Neural audio codecs compress waveforms into compact latent or discrete token sequences for high-fidelity reconstruction (Zeghid… view at source ↗

read the original abstract

AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper isolates token-space effects in music detection via a new open dataset and fixed classifier, but the real-corpus swap lacks checks that could confound the main claim.

read the letter

The main contribution is a controlled comparison of audio token spaces under generator shift. They release MoM-open, an open reconstruction of an earlier benchmark that swaps in FMA and MTG-Jamendo real audio while keeping the fake generators fixed, and CoMoE, a compact fixed classifier that lets them swap token extractors without changing the rest of the pipeline. On fake-source-restricted splits the results show X-Codec tokens performing best when the model trains on Udio fakes alone and MERT-derived tokens performing best on Suno-v3.5 fakes alone. That specific pattern is new relative to the cited prior work.

They earn credit for making the data and code public and for holding the downstream architecture constant so the token comparison is direct. The observation that standard splits saturate while restricted ones expose large gaps is useful for anyone thinking about deployment.

The soft spot is the reconstruction step. The claim that token spaces are the primary axis rests on the assumption that the new real corpus does not systematically alter how the tokenizers behave. The abstract gives no acoustic-feature overlap, genre balance, or length statistics to support that the swap leaves token-extractor behavior unchanged. If the replacement real audio differs in spectral or timbral ways that interact with codec versus MERT tokenization, the performance differences could be artifacts rather than token effects. No error bars or statistical tests are reported either, so the size and reliability of the gaps are hard to judge.

This is for researchers working on robust AI-music detectors who need to handle unseen generators. The new dataset and comparison method are concrete enough that a referee should examine the validation of the reconstruction and request the missing controls. I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard and real-source-restricted splits for AI-generated music detection are nearly saturated, but fake-source restriction on the introduced MoM-open dataset (an open reconstruction of MoM-CLAM using FMA/MTG-Jamendo) reveals large performance gaps between token spaces when using the fixed CoMoE classifier: X-Codec tokens perform best under Udio-only training while MERT-derived tokens are stronger under Suno-v3.5-only training. This leads to the suggestion that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift.

Significance. If the isolation of token-space effects from corpus artifacts holds, the results would identify representation choice as a key factor in robustness to unseen generators, with implications for detector design in music audio tasks. The public availability of code and data strengthens the work's reproducibility.

major comments (2)

[Abstract] Abstract and §3 (MoM-open construction): the central claim that observed gaps are attributable to token-space properties rather than dataset artifacts requires that the FMA/MTG-Jamendo replacement preserves the original real-corpus distribution with respect to token-extractor behavior, yet no validation (acoustic feature overlap, genre/length statistics, or spectral bias checks) is reported to support this isolation.
[Experiments] Experiments section: performance differences are reported as 'large' without error bars, statistical significance tests, multiple runs, or full training details (learning rate, epochs, batch size), which is load-bearing for the claim that X-Codec is strongest on Udio-only and MERT on Suno-v3.5-only.

minor comments (1)

The description of CoMoE as a 'compact fixed classifier' for comparing heterogeneous token spaces would benefit from an explicit architecture diagram or parameter count to clarify how the downstream recipe remains unchanged across extractors.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our work. We address the major comments point by point below, indicating planned revisions where applicable.

read point-by-point responses

Referee: [Abstract] Abstract and §3 (MoM-open construction): the central claim that observed gaps are attributable to token-space properties rather than dataset artifacts requires that the FMA/MTG-Jamendo replacement preserves the original real-corpus distribution with respect to token-extractor behavior, yet no validation (acoustic feature overlap, genre/length statistics, or spectral bias checks) is reported to support this isolation.

Authors: We agree that validating the distribution preservation is important for isolating token-space effects. Since the original MoM-CLAM real corpus is non-redistributable, direct acoustic feature overlap checks with it are not feasible. However, we will add genre distribution, track length statistics, and basic spectral feature comparisons (e.g., mean spectral centroid and rolloff) for the FMA/MTG-Jamendo real sources in the revised §3 to provide transparency on the replacement corpus. The core findings focus on performance gaps under fake-source restriction, where the real data acts as a fixed negative class. revision: partial
Referee: [Experiments] Experiments section: performance differences are reported as 'large' without error bars, statistical significance tests, multiple runs, or full training details (learning rate, epochs, batch size), which is load-bearing for the claim that X-Codec is strongest on Udio-only and MERT on Suno-v3.5-only.

Authors: This is a valid concern regarding the robustness of the reported differences. In the revised manuscript, we will rerun the experiments with multiple random seeds (at least 3-5), include error bars on the performance metrics, conduct statistical significance tests (e.g., paired t-tests) between token spaces, and provide complete training details including learning rate, number of epochs, batch size, and other hyperparameters in the Experiments section or a dedicated appendix. revision: yes

standing simulated objections not resolved

Direct validation of token-extractor behavior equivalence between the original non-redistributable real corpus and the FMA/MTG-Jamendo replacement, due to lack of access to the original data.

Circularity Check

0 steps flagged

Empirical study with independent dataset and fixed classifier; no circular derivation

full rationale

The paper conducts an empirical comparison of audio token spaces using a newly constructed open dataset (MoM-open) and a fixed downstream classifier (CoMoE) whose architecture and training recipe are held constant. Performance differences under fake-source restriction are reported directly from cross-generator experiments rather than derived from any fitted parameter, self-citation chain, or definitional equivalence. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claim to its inputs by construction. The recommendation to treat codec-style spaces as a primary axis follows from the observed experimental gaps and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on empirical differences observed with the new dataset and fixed classifier; relies on standard supervised classification assumptions and the fidelity of the dataset reconstruction.

axioms (1)

domain assumption Standard supervised learning assumptions for audio classification hold for the CoMoE setup.
Invoked implicitly when training the fixed classifier on token representations.

invented entities (1)

CoMoE no independent evidence
purpose: Compact fixed classifier enabling isolated comparison of heterogeneous audio token spaces.
Newly introduced in the paper as the experimental tool.

pith-pipeline@v0.9.1-grok · 5718 in / 1213 out tokens · 31656 ms · 2026-06-27T18:02:07.310731+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references

[1]

arXiv preprint arXiv:2405.04181 , year=

Detecting music deepfakes is easy but actually hard , author=. arXiv preprint arXiv:2405.04181 , year=

arXiv
[2]

AI-Generated Music Detection and its Challenges , year=

Afchar, Darius and Meseguer-Brocal, Gabriel and Hennequin, Romain , booktitle=. AI-Generated Music Detection and its Challenges , year=
[3]

Transactions on Machine Learning Research , issn=

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning , author=. Transactions on Machine Learning Research , issn=
[4]

Bogdanov, Dmitry and Won, Minz and Tovstogan, Philip and Porter, Alastair and Serra, Xavier , booktitle=. The
[5]

Cros Vila, Laura and Sturm, Bob L. T. and Casini, Luca and Dalmazzo, David , journal=. The
[6]

Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages=

Defferrard, Micha. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages=
[7]

Transactions on Machine Learning Research , year=

High Fidelity Neural Audio Compression , author=. Transactions on Machine Learning Research , year=
[8]

2022 , howpublished=

Riffusion: Stable Diffusion for Real-Time Music Generation , author=. 2022 , howpublished=

2022
[9]

High-Fidelity Audio Compression with Improved

Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan , booktitle=. High-Fidelity Audio Compression with Improved
[10]

Li, Xinfeng and Li, Kai and Zheng, Yifan and Yan, Chen and Ji, Xiaoyu and Xu, Wenyuan , booktitle=
[11]

and Liu, Ruibo and Chen, Wenhu and Xia, Gus and Shi, Yemin and Huang, Wenhao and Wang, Yike and Guo, Yike and Fu, Jie , booktitle=

Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Chen, Xingran and Yin, Hanzhi and Xiao, Chenghao and Lin, Chenghua and Ragni, Anton and Benetos, Emmanouil and Gyenge, Norbert and Dannenberg, Roger B. and Liu, Ruibo and Chen, Wenhu and Xia, Gus and Shi, Yemin and Huang, Wenhao and Wang, Yike and Guo, Yike and Fu, Jie , booktitle=
[12]

arXiv preprint arXiv:2412.00571 , year=

From Audio Deepfake Detection to AI-Generated Music Detection--A Pathway and Overview , author=. arXiv preprint arXiv:2412.00571 , year=

arXiv
[13]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Decoupled Weight Decay Regularization , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[14]

arXiv preprint arXiv:2503.01183 , year=

Diffrhythm: Blazingly fast and embarrassingly simple end-to-end full-length song generation with latent diffusion , author=. arXiv preprint arXiv:2503.01183 , year=

arXiv
[15]

Proceedings of the International Conference on Machine Learning , volume=

Learning Transferable Visual Models from Natural Language Supervision , author=. Proceedings of the International Conference on Machine Learning , volume=
[16]

Rahman, Md Awsafur and Hakim, Zaber Ibn Abdul and Sarker, Najibul Haque and Paul, Bishmoy and Fattah, Shaikh Anowarul , booktitle=
[17]

2024 , howpublished=

Suno. 2024 , howpublished=

2024
[18]

2024 , howpublished=

Udio. 2024 , howpublished=

2024
[19]

Wu, Haibin and Tseng, Yuan and Lee, Hung-yi , booktitle=
[20]

IEEE Transactions on Multimedia , volume=

Music popularity: Metrics, characteristics, and audio-based prediction , author=. IEEE Transactions on Multimedia , volume=
[21]

arXiv preprint arXiv:2603.16914 , year=

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection , author=. arXiv preprint arXiv:2603.16914 , year=

arXiv
[22]

A Sanity Check for

Yan, Shilin and Li, Ouxiang and Cai, Jiayin and Hao, Yanbin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi , booktitle=. A Sanity Check for
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

YuE: Scaling Open Foundation Models for Long-Form Music Generation , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[25]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

SoundStream: An End-to-End Neural Audio Codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
[26]

, booktitle=

Sculley, D. , booktitle=. Web-Scale
[27]

Comanducci, Luca and Bestagini, Paolo and Tubaro, Stefano , journal=

[1] [1]

arXiv preprint arXiv:2405.04181 , year=

Detecting music deepfakes is easy but actually hard , author=. arXiv preprint arXiv:2405.04181 , year=

arXiv

[2] [2]

AI-Generated Music Detection and its Challenges , year=

Afchar, Darius and Meseguer-Brocal, Gabriel and Hennequin, Romain , booktitle=. AI-Generated Music Detection and its Challenges , year=

[3] [3]

Transactions on Machine Learning Research , issn=

Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning , author=. Transactions on Machine Learning Research , issn=

[4] [4]

Bogdanov, Dmitry and Won, Minz and Tovstogan, Philip and Porter, Alastair and Serra, Xavier , booktitle=. The

[5] [5]

Cros Vila, Laura and Sturm, Bob L. T. and Casini, Luca and Dalmazzo, David , journal=. The

[6] [6]

Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages=

Defferrard, Micha. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages=

[7] [7]

Transactions on Machine Learning Research , year=

High Fidelity Neural Audio Compression , author=. Transactions on Machine Learning Research , year=

[8] [8]

2022 , howpublished=

Riffusion: Stable Diffusion for Real-Time Music Generation , author=. 2022 , howpublished=

2022

[9] [9]

High-Fidelity Audio Compression with Improved

Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan , booktitle=. High-Fidelity Audio Compression with Improved

[10] [10]

Li, Xinfeng and Li, Kai and Zheng, Yifan and Yan, Chen and Ji, Xiaoyu and Xu, Wenyuan , booktitle=

[11] [11]

and Liu, Ruibo and Chen, Wenhu and Xia, Gus and Shi, Yemin and Huang, Wenhao and Wang, Yike and Guo, Yike and Fu, Jie , booktitle=

Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Chen, Xingran and Yin, Hanzhi and Xiao, Chenghao and Lin, Chenghua and Ragni, Anton and Benetos, Emmanouil and Gyenge, Norbert and Dannenberg, Roger B. and Liu, Ruibo and Chen, Wenhu and Xia, Gus and Shi, Yemin and Huang, Wenhao and Wang, Yike and Guo, Yike and Fu, Jie , booktitle=

[12] [12]

arXiv preprint arXiv:2412.00571 , year=

From Audio Deepfake Detection to AI-Generated Music Detection--A Pathway and Overview , author=. arXiv preprint arXiv:2412.00571 , year=

arXiv

[13] [13]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Decoupled Weight Decay Regularization , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[14] [14]

arXiv preprint arXiv:2503.01183 , year=

Diffrhythm: Blazingly fast and embarrassingly simple end-to-end full-length song generation with latent diffusion , author=. arXiv preprint arXiv:2503.01183 , year=

arXiv

[15] [15]

Proceedings of the International Conference on Machine Learning , volume=

Learning Transferable Visual Models from Natural Language Supervision , author=. Proceedings of the International Conference on Machine Learning , volume=

[16] [16]

Rahman, Md Awsafur and Hakim, Zaber Ibn Abdul and Sarker, Najibul Haque and Paul, Bishmoy and Fattah, Shaikh Anowarul , booktitle=

[17] [17]

2024 , howpublished=

Suno. 2024 , howpublished=

2024

[18] [18]

2024 , howpublished=

Udio. 2024 , howpublished=

2024

[19] [19]

Wu, Haibin and Tseng, Yuan and Lee, Hung-yi , booktitle=

[20] [20]

IEEE Transactions on Multimedia , volume=

Music popularity: Metrics, characteristics, and audio-based prediction , author=. IEEE Transactions on Multimedia , volume=

[21] [21]

arXiv preprint arXiv:2603.16914 , year=

Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection , author=. arXiv preprint arXiv:2603.16914 , year=

arXiv

[22] [22]

A Sanity Check for

Yan, Shilin and Li, Ouxiang and Cai, Jiayin and Hao, Yanbin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi , booktitle=. A Sanity Check for

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[24] [24]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

YuE: Scaling Open Foundation Models for Long-Form Music Generation , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[25] [25]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

SoundStream: An End-to-End Neural Audio Codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

[26] [26]

, booktitle=

Sculley, D. , booktitle=. Web-Scale

[27] [27]

Comanducci, Luca and Bestagini, Paolo and Tubaro, Stefano , journal=