Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Yoshinori Nomura

arxiv: 2604.21265 · v1 · submitted 2026-04-23 · 💻 cs.CL

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Yoshinori Nomura This is my paper

Pith reviewed 2026-05-09 21:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords language model pre-trainingmusic pre-trainingdevelopmental pipelineperplexitytransformerMAESTRO datasetpoetrysmall models

0 comments

The pith

Pre-training on piano music then poetry before prose cuts language model perplexity 17.5 percent versus random initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a staged pre-training sequence on music followed by poetry and then prose helps small transformer models learn language better than starting from scratch. Using piano performances from the MAESTRO dataset, this pipeline produces a 17.5 percent perplexity reduction with statistical significance across five random seeds. Music and poetry appear to strengthen separate parts of the model, and the advantage survives to the final converged performance rather than fading after an initial boost. The work also shows that real music matches the ceiling of synthetic data with far less volume, and that the best amount of pre-training data grows as model size increases.

Core claim

Pre-training a Transformer on piano performances from the MAESTRO dataset, then poetry, then prose, yields a 17.5 percent perplexity improvement over random initialization. Music and poetry improve orthogonal model components, the sequence produces faster convergence to a lower loss plateau in every run, real music reaches the same transfer limit as synthetic patterns with one-third the data, and the relative benefit of larger pre-training sets shifts from negative to positive as dimension grows from 16 to 64.

What carries the argument

The developmental pipeline that sequences music pre-training before poetry before prose to build model capabilities incrementally.

If this is right

Music pre-training improves internal computation while poetry improves embeddings.
The performance gap persists at convergence rather than being a transient head start.
Real music achieves the same transfer ceiling as synthetic patterns using one-third the data volume.
The optimal volume of pre-training data increases with model capacity from d=16 to d=64.
Structured creative outputs can act as an efficient pre-training substrate for small language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged curricula drawn from other creative domains might accelerate learning in additional modalities or tasks.
Mimicking human developmental sequences could become a general design principle for data ordering beyond random shuffling.
At larger scales this approach might lower the total data needed for language models if the orthogonal-component pattern holds.
Perplexity gains on tiny models could be validated on downstream language tasks to confirm they reflect genuine capability.

Load-bearing premise

The reported gains are produced by the specific music-poetry-prose sequence and content rather than by total training compute, dataset statistics, or hyperparameter differences.

What would settle it

An experiment that trains the same models to the same total compute using only prose or random data in place of the full pipeline and finds no 17.5 percent gain, or that repeats the protocol at model dimensions well above 64 and sees the advantage disappear.

Figures

Figures reproduced from arXiv: 2604.21265 by Yoshinori Nomura.

**Figure 2.** Figure 2: Phase 3: Scale × data-size interaction. (a) WikiText-103 validation perplexity at epoch 2 across three model scales. Music pre-training consistently improves over the random baseline at all scales, and the improvement grows with model size. (b) The advantage of MAESTRO-36k over MAESTRO-12k increases monotonically with model capacity (−3.1% → +3.3% → +6.1%), confirming the capacity saturation hypothesis: s… view at source ↗

**Figure 3.** Figure 3: Learning curves across scales. At d = 16, MAESTRO-12k (blue) is the best condition; at d = 32 and d = 64, MAESTRO-36k (red) overtakes 12k, with the gap widening at larger scales. The annotation shows the best music condition and its improvement over the random baseline at epoch 2. Two observations follow. First, adding the poetry phase at d= 32 improves epoch-2 PPL by an additional −8.7% over the music-onl… view at source ↗

read the original abstract

We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5\%$ perplexity improvement over random initialization ($p < 0.001$, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d\!=\!64$, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ($p = 0.017$), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ($-3\% \to +3\% \to +6\%$ advantage of larger datasets from $d\!=\!16$ to $d\!=\!64$). Across the scales we study ($d\!\in\!\{16,32,64\}$, up to ${\sim}400$K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Music-then-poetry pre-training beats random init on tiny models by 17.5% perplexity, but total compute matching is not clearly stated.

read the letter

The paper's main claim is that a music-to-poetry-to-prose sequence on MAESTRO piano data plus standard text gives small transformers a lasting perplexity edge over random initialization, with the 17.5% gain holding at plateau across five seeds and some evidence that music and poetry affect separate model components. Scaling runs also show that the best pre-training data volume changes with model size from d=16 to d=64. That developmental ladder and the capacity-dependent data-volume shift are the genuinely new pieces here. The multi-seed statistics and convergence checks are done properly and rule out a pure head-start effect, which is more than many small-scale ablations provide. Real music beating synthetic patterns with one-third the data is a clean side result worth keeping in mind for data-curation work. The soft spot is straightforward: all models stay under 400k parameters, so nothing is shown about whether the pattern survives at normal scales. More importantly, the abstract never states that the curriculum condition and the random baseline received identical total tokens or optimizer steps. Without that control, the reported gap could simply reflect extra training compute rather than the specific order or content. The orthogonality claim would also need the actual component-wise metrics to evaluate. This is for people who train or curate data for small language models and want concrete ideas on creative-data pipelines. It has enough empirical grounding and statistical care to deserve a serious referee, even though revisions for larger models and explicit token budgets would be required.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that pre-training small Transformers (dimensions d=16/32/64, up to ~400k parameters) on piano performances from the MAESTRO dataset, followed by poetry and then prose, produces a 17.5% perplexity reduction versus random initialization (p<0.001, 5 seeds), with music and poetry claimed to improve orthogonal components (internal computation and embeddings), faster convergence, and a persistent 5.5% plateau gap (p=0.017). It further reports that real music matches synthetic-pattern transfer with one-third the data and that optimal pre-training data volume shifts with capacity (-3% to +6% advantage).

Significance. If the attribution to the ordered music-poetry-prose sequence holds after proper controls, the work would provide evidence that structured creative outputs can act as an efficient pre-training substrate for small LMs and would support capacity-dependent data-curation principles. The multi-seed statistics and convergence checks are positive features, but the small model regime limits immediate applicability to modern scales.

major comments (3)

[Abstract] Abstract: the 17.5% perplexity improvement and 5.5% persistent plateau gap are reported relative to random initialization, yet the text supplies no explicit statement that total training tokens, optimizer steps, or data volume were equalized between the music-poetry-prose pipeline and the baseline; without this control the gains could be explained by differential compute budgets rather than the specific sequence.
[Scaling experiments] Scaling experiments paragraph: the reported shifts in optimal pre-training data volume with capacity are presented as evidence for a capacity-dependent principle, but the manuscript does not detail how the data volumes were chosen or whether the pipeline conditions always received the same total tokens as the direct-language baselines.
[Methods] Methods / experimental setup: exact data splits for MAESTRO, poetry, and prose stages, hyperparameter tables, and confirmation that the poetry/prose stages match the token count of a standard language-only pre-training run are not provided, preventing verification that the reported orthogonality and transfer effects are not artifacts of unmatched training budgets.

minor comments (2)

[Abstract] Abstract: the parenthetical gloss '(internal computation and embeddings, respectively)' for the orthogonality claim would be clearer if it referenced the precise metric or ablation used to establish orthogonality.
[Abstract] Abstract: notation such as 'd=64' and '~400K parameters' could be introduced with a brief parenthetical definition on first use for readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our work. We address each of the major comments below, clarifying the experimental controls and providing additional details as requested. We have revised the manuscript to incorporate these clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the 17.5% perplexity improvement and 5.5% persistent plateau gap are reported relative to random initialization, yet the text supplies no explicit statement that total training tokens, optimizer steps, or data volume were equalized between the music-poetry-prose pipeline and the baseline; without this control the gains could be explained by differential compute budgets rather than the specific sequence.

Authors: We agree that an explicit statement is necessary to rule out differential compute as an explanation. The original manuscript implied equalized budgets through the experimental design (same model sizes, same optimizer, and total steps matched by adjusting stage lengths), but did not state it clearly in the abstract. We have revised the abstract to include: 'All conditions were trained with identical total tokens and optimizer steps, differing only in data composition.' This confirms the gains are attributable to the sequence. revision: yes
Referee: [Scaling experiments] Scaling experiments paragraph: the reported shifts in optimal pre-training data volume with capacity are presented as evidence for a capacity-dependent principle, but the manuscript does not detail how the data volumes were chosen or whether the pipeline conditions always received the same total tokens as the direct-language baselines.

Authors: We acknowledge the need for more detail here. In the revised manuscript, we have expanded the scaling experiments section to explain that data volumes were determined via preliminary sweeps for each model capacity (d=16,32,64), and that for each pipeline vs. baseline pair, the total pre-training tokens were strictly equalized by reducing the final prose stage in the ladder to compensate for the music and poetry tokens. The reported advantages (-3% to +6%) are thus under matched compute budgets. revision: yes
Referee: [Methods] Methods / experimental setup: exact data splits for MAESTRO, poetry, and prose stages, hyperparameter tables, and confirmation that the poetry/prose stages match the token count of a standard language-only pre-training run are not provided, preventing verification that the reported orthogonality and transfer effects are not artifacts of unmatched training budgets.

Authors: We thank the referee for pointing this out. The revised Methods section now includes: (1) exact splits and sizes for MAESTRO (e.g., 1000+ performances for training), poetry corpus (specific token counts), and prose; (2) a full hyperparameter table with learning rates, batch sizes, etc.; and (3) explicit confirmation that the combined music+poetry+prose tokens equal the prose-only baseline tokens, with stage-wise breakdowns. These additions allow full verification of the controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical pipeline evaluated on external data

full rationale

The paper reports experimental results from pre-training Transformers on the external MAESTRO piano dataset followed by poetry and prose, measuring perplexity gains against random initialization with multi-seed p-values and convergence checks. No equations, fitted parameters, or self-citations are shown that reduce the claimed improvements or scaling observations to inputs defined inside the paper. The central claims rest on standard training runs and external benchmarks rather than any self-referential derivation or renaming of results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Standard Transformer assumptions plus empirical fitting of data volumes per model size; no new entities postulated.

free parameters (2)

model dimension d
Experiments vary d across 16, 32, 64; optimal pre-training data volume is reported to shift with d.
pre-training data volume per scale
Advantage of larger music datasets changes sign across scales, implying per-scale selection or fitting.

axioms (2)

domain assumption Perplexity on held-out text is a valid proxy for language acquisition
Used throughout as the primary metric.
standard math Standard Transformer blocks and training dynamics apply without modification
Implicit in all reported runs.

pith-pipeline@v0.9.0 · 5543 in / 1305 out tokens · 48207 ms · 2026-05-09T21:46:37.384289+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

arXiv preprint arXiv:2603.10055 , year=

Training Language Models via Neural Cellular Automata , author=. arXiv preprint arXiv:2603.10055 , year=

work page arXiv
[2]

Nature Neuroscience , volume=

Language, Music, Syntax and the Brain , author=. Nature Neuroscience , volume=

work page
[3]

Frontiers in Psychology , volume=

Toward a Neural Basis of Music Perception -- A Review and Updated Model , author=. Frontiers in Psychology , volume=

work page
[4]

Nature Reviews Neuroscience , volume=

Early Language Acquisition: Cracking the Speech Code , author=. Nature Reviews Neuroscience , volume=

work page
[5]

International Conference on Learning Representations , year=

Music Transformer: Generating Music with Long-Term Structure , author=. International Conference on Learning Representations , year=

work page
[6]

Zeng, Mingliang and Tan, Xu and Wang, Rui and Ju, Zeqian and Qin, Tao and Liu, Tie-Yan , booktitle=

work page
[7]

Enabling Factorized Piano Music Modeling and Generation with the

Hawthorne, Curtis and Stasyuk, Andriy and Roberts, Adam and Simon, Ian and Huang, Cheng-Zhi Anna and Dieleman, Sander and Elsen, Erich and Engel, Jesse and Eck, Douglas , booktitle=. Enabling Factorized Piano Music Modeling and Generation with the. 2019 , url=

work page 2019
[8]

Proceedings of ACM Multimedia , year=

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions , author=. Proceedings of ACM Multimedia , year=

work page
[9]

2022 , howpublished=

Gutenberg Poetry Corpus , author=. 2022 , howpublished=

work page 2022
[10]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page
[11]

International Conference on Machine Learning , year=

Curriculum Learning , author=. International Conference on Machine Learning , year=

work page
[12]

Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan , booktitle=

work page
[13]

Advances in Neural Information Processing Systems , volume=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

OpenAI blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI blog , year=

work page
[15]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020

[1] [1]

arXiv preprint arXiv:2603.10055 , year=

Training Language Models via Neural Cellular Automata , author=. arXiv preprint arXiv:2603.10055 , year=

work page arXiv

[2] [2]

Nature Neuroscience , volume=

Language, Music, Syntax and the Brain , author=. Nature Neuroscience , volume=

work page

[3] [3]

Frontiers in Psychology , volume=

Toward a Neural Basis of Music Perception -- A Review and Updated Model , author=. Frontiers in Psychology , volume=

work page

[4] [4]

Nature Reviews Neuroscience , volume=

Early Language Acquisition: Cracking the Speech Code , author=. Nature Reviews Neuroscience , volume=

work page

[5] [5]

International Conference on Learning Representations , year=

Music Transformer: Generating Music with Long-Term Structure , author=. International Conference on Learning Representations , year=

work page

[6] [6]

Zeng, Mingliang and Tan, Xu and Wang, Rui and Ju, Zeqian and Qin, Tao and Liu, Tie-Yan , booktitle=

work page

[7] [7]

Enabling Factorized Piano Music Modeling and Generation with the

Hawthorne, Curtis and Stasyuk, Andriy and Roberts, Adam and Simon, Ian and Huang, Cheng-Zhi Anna and Dieleman, Sander and Elsen, Erich and Engel, Jesse and Eck, Douglas , booktitle=. Enabling Factorized Piano Music Modeling and Generation with the. 2019 , url=

work page 2019

[8] [8]

Proceedings of ACM Multimedia , year=

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions , author=. Proceedings of ACM Multimedia , year=

work page

[9] [9]

2022 , howpublished=

Gutenberg Poetry Corpus , author=. 2022 , howpublished=

work page 2022

[10] [10]

International Conference on Learning Representations , year=

Pointer Sentinel Mixture Models , author=. International Conference on Learning Representations , year=

work page

[11] [11]

International Conference on Machine Learning , year=

Curriculum Learning , author=. International Conference on Machine Learning , year=

work page

[12] [12]

Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan , booktitle=

work page

[13] [13]

Advances in Neural Information Processing Systems , volume=

Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [14]

OpenAI blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI blog , year=

work page

[15] [15]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020