BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Haoyu Gu; Jingwei Zhao; Lekai Qian; Ziyu Wang

arxiv: 2604.19532 · v3 · pith:B7KWVUBZnew · submitted 2026-04-21 · 💻 cs.SD · cs.AI

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Lekai Qian , Haoyu Gu , Jingwei Zhao , Ziyu Wang This is my paper

Pith reviewed 2026-05-10 01:14 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords symbolic music tokenizationbeat-based encodingmusic generationtransformer modelspiano roll representationlong-range dependenciesstructural coherence

0 comments

The pith

Uniform beat steps in music tokenization improve generation quality and long-range pattern capture compared to event-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that symbolic music can be tokenized more effectively by dividing time into uniform steps such as beats, rather than using sequences of musical events with variable durations. Within each step, events sharing the same pitch are collapsed into one token, and the sequence is organized explicitly by these time steps to mimic a sparse piano-roll grid. When applied to Transformer models for continuing music or generating accompaniments, this method produces outputs with greater musical quality and structural coherence. Analyses also indicate gains in efficiency and an improved ability to model patterns across extended time spans. A reader would care because current music language models often falter on rhythmic consistency and long-term form, and a regular temporal grid might address these issues directly.

Core claim

The core claim is that a uniform temporal step tokenization, where music is segmented into fixed-length units and same-pitch notes per unit are merged, enables better performance on symbolic music generation tasks than traditional event-based encodings that allow non-uniform time progression.

What carries the argument

The BEAT tokenization, which uses uniform-length musical steps as the basic unit, encodes all same-pitch events in a step as a single token, and groups tokens by time step.

If this is right

Generated music exhibits improved quality and structural coherence on continuation and accompaniment tasks.
The approach captures long-range patterns more effectively than event-based tokenizations.
Token sequences are shorter and more efficient to process.
Models benefit from explicit uniform time progression rather than implicit handling through variable durations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This tokenization could extend to other time-based arts like dance or speech synthesis where regular timing aids coherence.
Future models might combine this with event-based elements for hybrid precision in timing.
Evaluation on datasets with complex rhythms could test if uniform steps introduce artifacts in non-grid music.

Load-bearing premise

Discretizing music into uniform beat steps and merging same-pitch events within steps preserves all musically relevant information without significant loss or ambiguity.

What would settle it

If listener studies or objective metrics show that event-based models produce more coherent and higher-quality music than beat-based ones when both are trained and evaluated identically, the advantage would be disproven.

Figures

Figures reproduced from arXiv: 2604.19532 by Haoyu Gu, Jingwei Zhao, Lekai Qian, Ziyu Wang.

**Figure 1.** Figure 1: Overview of the BEAT encoding framework. token with its corresponding pattern and velocity tokens, representing the beat as: u = (d1, sp1 , vp1 )⊕(d2, sp2 , vp2 )⊕· · ·⊕(dM, spM , vpM ). (3) For empty beats (M = 0), we set u = Rest, a special token. Step 3: Sequence construction. To assemble beat-level sequences u into a complete multi-track sequence, we introduce three additional token categories. We use… view at source ↗

**Figure 2.** Figure 2: Subjective evaluation results. Bar plots report mean ratings and standard errors. * indicates a statistically significant difference (p < 0.05) based on pairwise t-tests with Holm-Bonferroni correction; “ns” denotes non-significant differences. coherence and distributional similarity to real music. While some baselines achieve competitive JSSC, they exhibit substantially worse JSGC, indicating irregular r… view at source ↗

**Figure 3.** Figure 3: Unique beat growth curves for music continuation. BEAT (red) closely tracks ground truth (blue), while Compound Word (green) shows excessive diversity and Interleaved ABC (purple) shows excessive repetition. reduces training burden. Section 5.2 probes the ability to capture locality patterns across pitch and time, which may enhance plausibility in generation. In Section 5.3, we study an additional real-tim… view at source ↗

**Figure 4.** Figure 4: BPE compression rate across the number of BPE merges. Lower values indicate stronger regularity. Transposition, where each bar is a one-semitone transposition of the previous bar, probing pitch-invariant locality; (2) Beat Interleaving, where each bar follows a fixed rhythmic pattern (AAAA, ABAB, or Mixed), evaluating beatlevel structural regularities; (3) Time-Shift Reconstruction, where the first 4 ba… view at source ↗

**Figure 5.** Figure 5: Subjective evaluation results for real-time accompaniment generation. Bar plots report mean ratings and standard errors. * indicates a statistically significant difference (p < 0.05). We fine-tune our piano continuation model on the accompaniment generation task and compare it against SongDriver (Wang et al., 2022), a two-stage system specifically designed for real-time accompaniment. It first generates… view at source ↗

**Figure 6.** Figure 6: Stepwise transposition pattern: each bar is transposed up by one semitone from the previous bar. I.2. Beat Interleaving [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: ABAB pattern: beats alternate in an A-B-A-B pattern within each bar. Dataset. We construct synthetic 8-bar sequences with controlled rhythmic structures. Each bar contains 4 beats following a specific pattern: • AAAA ( [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: AAAA pattern: four identical beats per bar. I.3. Time-Shift Reconstruction Dataset. Each sequence contains a 4-bar prompt followed by the same content delayed by k ∈ {0, 1, 2, 3} beats. This pattern probes time-invariant locality. Evaluation. We use a held-out set of 400 samples per shift amount for final evaluation. Given the 4-bar prompt, models generate the time-shifted continuation using deterministic … view at source ↗

read the original abstract

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes BEAT, an alternative tokenization for symbolic music that replaces variable-duration event sequences with uniform temporal steps (e.g., beats). Within each step, same-pitch events are collapsed into single tokens and tokens are explicitly grouped by step, yielding a sparse piano-roll-like representation. The approach is evaluated on music continuation and accompaniment generation tasks against mainstream event-based tokenizers, with claims of superior musical quality, structural coherence, computational efficiency, and improved modeling of long-range patterns.

Significance. If the empirical claims hold after rigorous validation, the work would demonstrate that explicit temporal regularity in tokenization can outperform implicit handling via event durations, potentially enabling more efficient long-context modeling in music transformers and providing a bridge to grid-based architectures. The sparse encoding could also reduce sequence lengths while preserving polyphony.

major comments (3)

[Abstract] Abstract: the central claim that the uniform-step tokenization yields 'improved musical quality and structural coherence' is stated without any quantitative metrics, baselines, statistical tests, or dataset details. The full experimental section (presumably §4 or §5) must supply these to establish that the reported gains are not artifacts of the coarser representation.
[Abstract / §3] The tokenization description (Abstract and §3) collapses all same-pitch events within a fixed beat step into one token and discards intra-beat onset timing. This directly engages the weakest assumption: for music with swing, rubato, or precise polyphonic offsets, the loss of sub-beat information must be shown to be musically inconsequential. No analysis or ablation addressing this quantization error is referenced.
[Abstract] Efficiency and long-range capture claims (Abstract) require concrete comparisons—e.g., tokens per second, attention span in beats, or perplexity on long sequences—against event-based baselines. Without these numbers, it is impossible to determine whether gains stem from the uniform grid or simply from shorter sequences.

minor comments (2)

[§3] Clarify the exact beat resolution (e.g., 16th-note grid or quarter-note) and how ties or rests spanning steps are encoded.
[Abstract] The abstract mentions 'additional analyses' for efficiency and long-range patterns; these should be explicitly labeled and placed in a dedicated subsection with figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying where the full manuscript already supplies details and indicating revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the uniform-step tokenization yields 'improved musical quality and structural coherence' is stated without any quantitative metrics, baselines, statistical tests, or dataset details. The full experimental section (presumably §4 or §5) must supply these to establish that the reported gains are not artifacts of the coarser representation.

Authors: The abstract is intentionally concise, but Sections 4 and 5 of the full manuscript detail the experimental setup, including the datasets used, event-based baselines (e.g., REMI-style tokenizers), objective and subjective metrics for musical quality and structural coherence, and statistical comparisons. We will revise the abstract to briefly reference key quantitative gains and explicitly direct readers to the experimental sections. revision: yes
Referee: [Abstract / §3] The tokenization description (Abstract and §3) collapses all same-pitch events within a fixed beat step into one token and discards intra-beat onset timing. This directly engages the weakest assumption: for music with swing, rubato, or precise polyphonic offsets, the loss of sub-beat information must be shown to be musically inconsequential. No analysis or ablation addressing this quantization error is referenced.

Authors: The design intentionally quantizes to uniform beat steps to enforce temporal regularity. While this discards sub-beat timing (potentially relevant for rubato or swing), evaluations on standard datasets show net gains in coherence and quality. The manuscript does not contain a dedicated ablation on quantization error; we will add a limitations discussion and brief analysis of timing-sensitive cases in the revision. revision: partial
Referee: [Abstract] Efficiency and long-range capture claims (Abstract) require concrete comparisons—e.g., tokens per second, attention span in beats, or perplexity on long sequences—against event-based baselines. Without these numbers, it is impossible to determine whether gains stem from the uniform grid or simply from shorter sequences.

Authors: Section 5 already reports efficiency metrics (sequence length reductions) and long-range analyses (perplexity over extended contexts and effective beat-span coverage). These distinguish the contribution of the uniform grid from mere length reduction. We will expand this into an explicit comparison table with tokens-per-second and attention-span figures in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity in tokenization proposal or empirical claims

full rationale

The paper introduces a uniform temporal step tokenization (grouping same-pitch events per fixed beat-like step into single tokens) and evaluates it via direct comparison against event-based baselines on music continuation and accompaniment tasks. No equations, fitted parameters, or predictions are described that reduce by construction to the inputs; the reported gains in quality, coherence, efficiency, and long-range modeling are framed as outcomes of external benchmarks rather than internal redefinitions or self-citation chains. The central assumption (that uniform discretization preserves musically relevant information) is testable against the baselines and does not collapse into tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the core modeling choice of uniform beats.

axioms (1)

domain assumption Music possesses a sufficiently regular pulse that uniform beat-length discretization captures essential temporal structure without critical loss.
The tokenization groups events by beat and assumes this discretization is musically adequate.

pith-pipeline@v0.9.0 · 5505 in / 1255 out tokens · 34402 ms · 2026-05-10T01:14:40.867022+00:00 · methodology

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)