{"paper":{"title":"Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"BandTok turns music into a 2D time-frequency token grid from a single shared codebook, reducing sequential dependencies for autoregressive generation.","cross_cats":["cs.AI"],"primary_cat":"cs.SD","authors_text":"Guochen Yu, Xiaotao Gu, Xingyu Ma, Yuqing Cheng","submitted_at":"2026-05-15T10:35:49Z","abstract_excerpt":"Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more inde"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"BandTok yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling than residual-codebook tokenizers.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The residual hierarchy in existing high-fidelity codecs imposes strong sequential dependencies that amplify error accumulation during autoregressive generation after sequence flattening; the single shared codebook in BandTok avoids this while preserving reconstruction quality.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"BandTok turns music into a 2D time-frequency token grid from a single shared codebook, reducing sequential dependencies for autoregressive generation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5da5a4f6b79c9ba79d73cdff9a2787e65b0287d19813e292b3b28f4583992545"},"source":{"id":"2605.15831","kind":"arxiv","version":1},"verdict":{"id":"55b822d9-f159-452d-bb0a-5bedd8f9de6d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T18:42:43.376094Z","strongest_claim":"BandTok yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling than residual-codebook tokenizers.","one_line_summary":"BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The residual hierarchy in existing high-fidelity codecs imposes strong sequential dependencies that amplify error accumulation during autoregressive generation after sequence flattening; the single shared codebook in BandTok avoids this while preserving reconstruction quality.","pith_extraction_headline":"BandTok turns music into a 2D time-frequency token grid from a single shared codebook, reducing sequential dependencies for autoregressive generation."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15831/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T19:01:19.007650Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T18:52:03.665460Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T17:33:48.719292Z","status":"skipped","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T17:21:55.858488Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"ea0e2a49401d112f15464e7410639a4886e168baf4e0a64a51ea9640f4b337db"},"references":{"count":33,"sample":[{"doi":"","year":2021,"title":"Soundstream: An end-to-end neural audio codec,","work_id":"9fbb792b-e036-44f0-b2e7-5f5b614d0f9d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"High Fidelity Neural Audio Compression","work_id":"bc645d2d-e9f2-4cb8-9a6d-bd557bc7a258","ref_index":2,"cited_arxiv_id":"2210.13438","is_internal_anchor":true},{"doi":"","year":2023,"title":"High-fidelity audio compression with improved rvqgan,","work_id":"19f9e2a7-acb1-4a5b-b00a-04e3d9f80a41","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Audiolm: a language modeling approach to audio generation,","work_id":"bd60205e-fea4-4469-841c-44cf3c04ac71","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"MusicLM: Generating Music From Text","work_id":"15e6566e-1c36-468f-966e-823248cbf87f","ref_index":5,"cited_arxiv_id":"2301.11325","is_internal_anchor":true}],"resolved_work":33,"snapshot_sha256":"dc66c0399d6ca420ef5144fce57dfb62ab236176c887334318f4249a1b90521f","internal_anchors":6},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}