pith. sign in

arxiv: 2605.15831 · v1 · pith:EIQJJ4XWnew · submitted 2026-05-15 · 💻 cs.SD · cs.AI

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

Pith reviewed 2026-05-19 18:42 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords music generationaudio tokenizerautoregressive modelingmel-spectrogram2D tokenizationsingle codebooktime-frequency representation
0
0 comments X

The pith

BandTok turns music into a 2D time-frequency token grid from a single shared codebook, reducing sequential dependencies for autoregressive generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BandTok as a tokenizer that represents each Mel-spectrogram frame with tokens from one shared codebook, one per frequency band. This produces a grid of tokens across time and frequency that carries less imposed order than the stacked residuals of existing audio codecs. Because the tokens are more independent after flattening, autoregressive language models suffer less from error buildup during music generation. The design keeps reconstruction quality high by adding a multi-scale PatchGAN loss and EMA codebook updates, and it pairs the grid with 2D Rotary Position Embeddings so the model respects both time and frequency axes.

Core claim

BandTok is a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling than residual-codebook tokenizers. Reconstruction quality is maintained through a multi-scale PatchGAN objective and EMA codebook updates, while an autoregressive language model with 2D Rotary Position Embedding preserves the temporal and frequency-band structure.

What carries the argument

BandTok, a 2D Mel-spectrogram tokenizer that draws each frequency-band token from one shared codebook to form an independent time-frequency grid.

If this is right

  • BandTok produces higher-fidelity reconstructions than residual-codebook baselines via multi-scale PatchGAN and EMA updates.
  • The single-codebook 2D grid reduces error accumulation during autoregressive decoding compared with hierarchical residuals.
  • 2D Rotary Position Embeddings allow the language model to respect both time order and frequency-band relations.
  • The approach yields competitive music generation results even when training data are limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The time-frequency grid view could be tested on speech or environmental audio to check whether the independence benefit transfers beyond music.
  • Longer generated sequences may show clearer advantages for BandTok because residual dependencies compound over time.
  • The image-like treatment of audio opens direct borrowing of vision-model tricks such as patch-based attention without new architecture work.

Load-bearing premise

Flattening residual multi-codebook sequences imposes harmful sequential dependencies that a single shared codebook avoids without losing reconstruction quality.

What would settle it

Train identical autoregressive models on the same music data once with BandTok tokens and once with residual-codebook tokens, then compare generation quality metrics and listening-test scores for error accumulation.

Figures

Figures reproduced from arXiv: 2605.15831 by Guochen Yu, Xiaotao Gu, Xingyu Ma, Yuqing Cheng.

Figure 1
Figure 1. Figure 1: Comparison between residual and band-wise tokens. Normalized [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between RVQ tokenizers and BandTok. Figure (a) shows [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of 2D RoPE for flattened audio tokens. It preserves [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each audio frame using Mel-frequency band tokens drawn from a single shared codebook. This produces a physically interpretable time-frequency token grid intended to exhibit more independent token structure than residual multi-codebook quantizers, thereby reducing sequential dependencies and error accumulation when flattened for autoregressive modeling. The tokenizer is trained with a multi-scale PatchGAN objective and EMA codebook updates to improve reconstruction fidelity. The paper further proposes an autoregressive language model equipped with 2D Rotary Position Embedding (2D RoPE) to preserve both temporal and frequency-band structure. Experiments are reported to show gains over residual-codebook baselines in reconstruction and generation quality, particularly under data-limited conditions.

Significance. If the empirical gains are robust and the independence claim is substantiated, BandTok could offer a useful alternative to standard audio codecs for autoregressive music generation by aligning tokenization more closely with the physical time-frequency structure of audio. The public release of source code and generation demos supports reproducibility and is a clear strength.

major comments (1)
  1. [Experiments] The central claim that the single shared codebook produces measurably more independent tokens and lowers error propagation rests on an untested assumption. The manuscript reports overall reconstruction and generation metrics but provides no token-level statistics (conditional entropy, mutual information between successive tokens, or per-step reconstruction degradation curves) that would isolate the effect of the shared codebook from contributions of PatchGAN training, EMA updates, or 2D RoPE.
minor comments (2)
  1. [Abstract] The abstract asserts quantitative improvements without reporting specific metrics, error bars, or dataset details; these should be summarized with numbers in the abstract or a dedicated results table.
  2. [Method] Clarify the precise sequence length and flattening procedure for the 2D band-token grid versus residual-codebook sequences to allow direct comparison of dependency structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment below and outline revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Experiments] The central claim that the single shared codebook produces measurably more independent tokens and lowers error propagation rests on an untested assumption. The manuscript reports overall reconstruction and generation metrics but provides no token-level statistics (conditional entropy, mutual information between successive tokens, or per-step reconstruction degradation curves) that would isolate the effect of the shared codebook from contributions of PatchGAN training, EMA updates, or 2D RoPE.

    Authors: We agree that direct token-level analyses would more rigorously substantiate the independence claim and isolate the contribution of the single shared codebook. In the revised manuscript we will add conditional entropy and mutual information measurements between successive tokens, comparing BandTok against residual multi-codebook baselines. We will also report per-step reconstruction degradation curves under autoregressive rollout to quantify error accumulation. These additions will help separate the tokenizer design from the effects of PatchGAN training, EMA updates, and 2D RoPE. The observed gains in reconstruction fidelity and generation quality, especially under data-limited conditions, provide complementary evidence that the flattened token sequence is more amenable to autoregressive modeling. revision: yes

Circularity Check

0 steps flagged

No derivation reduces to fitted parameter or self-citation by construction; claims rest on explicit design choice and reported metrics

full rationale

The paper proposes BandTok as a 2D single-codebook tokenizer and asserts that this yields a more independent token structure better suited for autoregressive modeling. This is presented as a design rationale rather than a derived result. No equation or step equates a prediction to its own fitted input, and no load-bearing claim relies on a self-citation chain that itself reduces to the target result. Empirical comparisons of reconstruction and generation quality are reported separately from the architectural choice, leaving the central assumption testable rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the design assumption that single-codebook 2D tokens are more independent than residual hierarchies, plus standard neural training choices.

free parameters (2)
  • codebook size
    Size of the shared codebook is a hyperparameter chosen or tuned for the tokenizer.
  • PatchGAN scales
    Number and configuration of scales in the multi-scale PatchGAN objective.
axioms (1)
  • domain assumption Mel-spectrogram representation preserves perceptually relevant audio structure
    Invoked when choosing Mel-frequency bands as the basis for tokenization.
invented entities (1)
  • BandTok 2D tokenizer no independent evidence
    purpose: Represent music as independent time-frequency band tokens
    New architecture introduced to address residual dependency issues.

pith-pipeline@v0.9.0 · 5705 in / 1309 out tokens · 37445 ms · 2026-05-19T18:42:43.376094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 6 internal anchors

  1. [1]

    Soundstream: An end-to-end neural audio codec,

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

  2. [2]

    High Fidelity Neural Audio Compression

    Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

  3. [3]

    High-fidelity audio compression with improved rvqgan,

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,”Advances in Neural Information Processing Systems, vol. 36, pp. 27980–27993, 2023

  4. [4]

    Audiolm: a language modeling approach to audio generation,

    Zal ´an Borsos, Rapha ¨el Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al., “Audiolm: a language modeling approach to audio generation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023

  5. [5]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I Denk, Zal ´an Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023

  6. [6]

    Simple and controllable music generation,

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D´efossez, “Simple and controllable music generation,”Advances in neural information processing systems, vol. 36, pp. 47704–47720, 2023

  7. [7]

    Uniaudio: An audio founda- tion model toward universal audio generation,

    Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, et al., “Uniaudio: An audio foundation model toward universal audio generation,”arXiv preprint arXiv:2310.00704, 2023

  8. [8]

    An independence-promoting loss for music generation with language models,

    Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, and Alexandre D ´efossez, “An independence-promoting loss for music generation with language models,”arXiv preprint arXiv:2406.02315, 2024

  9. [9]

    Melcap: A unified single-codebook neural codec for high-fidelity audio compression,

    Jingyi Li, Zhiyuan Zhao, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, and Yu Li, “Melcap: A unified single-codebook neural codec for high-fidelity audio compression,” 2025

  10. [10]

    Unisrcodec: Unified and low-bitrate single codebook codec with sub-band reconstruction,

    Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Shengbo Cai, Guoyang Zeng, and Zhiyong Wu, “Unisrcodec: Unified and low-bitrate single codebook codec with sub-band reconstruction,”arXiv preprint arXiv:2601.02776, 2026

  11. [11]

    Image- to-image translation with conditional adversarial networks,

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image- to-image translation with conditional adversarial networks,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134

  12. [12]

    Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models.arXiv preprint arXiv:2602.10934,

    Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, et al., “Moss-audio-tokenizer: Scaling audio tokenizers for future audio foundation models,”arXiv preprint arXiv:2602.10934, 2026

  13. [13]

    Spectral codecs: Improving non-autoregressive speech synthesis with spectrogram-based audio codecs,

    Ryan Langman, Ante Juki ´c, Kunal Dhawan, Nithin Rao Koluguri, and Jason Li, “Spectral codecs: Improving non-autoregressive speech synthesis with spectrogram-based audio codecs,”arXiv preprint arXiv:2406.05298, 2024

  14. [14]

    Apcodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,

    Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling, “Apcodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024

  15. [15]

    Stftcodec: High-fidelity audio compression through time-frequency domain representation,

    Tao Feng, Zhiyuan Zhao, Yifan Xie, Yuqi Ye, Xiangyang Luo, Xun Guan, and Yu Li, “Stftcodec: High-fidelity audio compression through time-frequency domain representation,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

  16. [16]

    Alfred Haar,Zur theorie der orthogonalen funktionensysteme, Georg- August-Universitat, Gottingen., 1909

  17. [17]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al., “Cosmos world foundation model platform for physical ai,”arXiv preprint arXiv:2501.03575, 2025

  18. [18]

    Bigvgan: A universal neural vocoder with large-scale training,

    Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” arXiv preprint arXiv:2206.04658, 2022

  19. [19]

    Perceptual losses for real-time style transfer and super-resolution,

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean Conference on Computer Vision, 2016, pp. 694–711

  20. [20]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  21. [21]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020

  22. [22]

    FMA: A Dataset For Music Analysis

    Micha ¨el Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson, “Fma: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

  23. [23]

    Freesound datasets: A platform for the creation of open audio datasets.,

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound datasets: A platform for the creation of open audio datasets.,” inISMIR, 2017, pp. 486–493

  24. [24]

    The mtg-jamendo dataset for automatic music tagging,

    Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra, “The mtg-jamendo dataset for automatic music tagging,” inMachine learning for music discovery workshop, international con- ference on machine learning (ICML 2019). Long Beach, CA, United States, 2019, pp. 1–3

  25. [25]

    Musdb18-hq-an uncompressed version of musdb18,

    Zafar Rafii, Antoine Liutkus, Fabian-Robert St ¨oter, Stylianos Ioannis Mimilakis, and Rachel Bittner, “Musdb18-hq-an uncompressed version of musdb18,”(No Title), 2019

  26. [26]

    Academic text-to-music grand chal- lenge: Datasets, baselines, and evaluation methods,

    Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang, Hung-yi Lee, Hao- Wen Dong, and Yi-Hsuan Yang, “Academic text-to-music grand chal- lenge: Datasets, baselines, and evaluation methods,” inInternational Conference on Multimedia and Expo, Grand Challenge Paper, 2026

  27. [27]

    Mel-Band Roformer for music source separa- tion,

    Ju-Chiang Wang, Wei-Tsung Lu, and Minz Won, “Mel-band roformer for music source separation,”arXiv preprint arXiv:2310.01809, 2023

  28. [28]

    Clap learning audio concepts from natural language supervision,

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huam- ing Wang, “Clap learning audio concepts from natural language supervision,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  29. [29]

    Look, listen, and learn more: Design choices for deep audio embed- dings,

    Jason Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello, “Look, listen, and learn more: Design choices for deep audio embed- dings,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3852–3856

  30. [30]

    The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv:2311.10057,

    Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al., “The song describer dataset: a corpus of audio captions for music-and-language evaluation,”arXiv preprint arXiv:2311.10057, 2023

  31. [31]

    Stable audio open,

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Stable audio open,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  32. [32]

    Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al., “Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound,”arXiv preprint arXiv:2502.05139, 2025

  33. [33]

    Fast timing-conditioned latent audio diffusion,

    Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons, “Fast timing-conditioned latent audio diffusion,” inForty-first Interna- tional Conference on Machine Learning, 2024