Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Natalia Polouliakh; Shogo Noguchi; Shun Minamikawa; Tai Nakamura; Taketo Akama

arxiv: 2603.03190 · v3 · pith:RYQYVXPInew · submitted 2026-03-03 · 💻 cs.AI · q-bio.NC

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Shogo Noguchi , Taketo Akama , Tai Nakamura , Shun Minamikawa , Natalia Polouliakh This is my paper

Pith reviewed 2026-05-21 11:32 UTC · model grok-4.3

classification 💻 cs.AI q-bio.NC

keywords EEG decodingmusic identificationANN representationsexpectation modelingacoustic featurespretrainingbrain activitypredictive coding

0 comments

The pith

Pretraining EEG models on acoustic and expectation ANN representations improves music identification from brain activity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that models pretrained to predict either acoustic or expectation-related ANN representations outperform non-pretrained baselines when identifying music from EEG signals. Combining the two representation types produces complementary gains that surpass performance from ensembles created by varying random initializations. This indicates that the specific type of teacher representation influences how well the model captures cortical activity patterns during music listening. The expectation representation is generated directly from raw audio without manual labels and captures predictive structure beyond basic features such as onset or pitch. The results point to the possibility of guiding representation learning for neural decoding by using principles derived from how the brain encodes music.

Core claim

Models pretrained to predict either acoustic or expectation ANN representations outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. The expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch. This shows that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding.

What carries the argument

Distinct acoustic and expectation ANN representations used as separate teacher targets to pretrain EEG decoding models for music identification.

If this is right

Pretrained models on either representation outperform non-pretrained baselines.
Combining acoustic and expectation targets produces gains beyond those from random seed ensembles.
Teacher representation type shapes downstream EEG decoding performance.
Representation learning for brain signals can be guided by neural encoding principles.
The expectation representation enables investigation of multilayer predictive encoding without manual labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar pretraining on distinct representation types could be tested for decoding other cognitive processes that involve prediction, such as speech from EEG.
The method may support creation of more general-purpose EEG models that scale to larger and more varied stimulus sets.
Future experiments could check whether these gains hold when the music stimuli or listener groups differ substantially from the training data.

Load-bearing premise

The acoustic and expectation ANN representations distinctly and accurately reflect separable components of cortical activity during music listening.

What would settle it

An experiment showing that pretraining on unrelated or randomly generated targets produces comparable gains in EEG music identification accuracy would indicate the improvements stem from general pretraining rather than the specific content of these representations.

read the original abstract

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that pretraining EEG decoding models to predict acoustic and expectation-related representations extracted from ANNs improves music identification performance from brain activity. Pretrained models on either target outperform non-pretrained baselines, and their combination produces complementary gains that exceed those obtained from strong ensembles created by varying random initializations. The expectation representation is derived directly from raw audio signals without manual labels and is argued to capture predictive structure beyond onsets or pitch, supporting scalable, label-free investigation of multilayer predictive encoding in music cognition.

Significance. If the central empirical claims hold after appropriate controls, the work provides evidence that specific ANN-derived representations aligned with hypothesized cortical components can serve as effective teacher signals for EEG representation learning. This could support development of general-purpose neural decoding models grounded in cortical encoding principles and advance predictive models of music cognition. The label-free computation of the expectation representation and its scalability to diverse datasets are potential strengths for broader applicability.

major comments (1)

Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.

minor comments (3)

Methods section: Additional details are needed on EEG dataset size, number of subjects, preprocessing steps, exact ANN architectures used for target extraction, and the downstream EEG model architecture to allow full reproducibility and assessment of post-hoc choices.
Figures and results: Performance plots and tables should include error bars, standard deviations, or statistical significance tests to quantify variability across seeds and subjects, as the abstract reports improvements without visible uncertainty measures.
Abstract and introduction: Clarify the precise definition and computation of the 'expectation representation' early on, including how it differs from basic acoustic features like onset or pitch, to strengthen the claim of capturing multilayer predictive encoding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our manuscript. We address the concern about control ablations below and will incorporate the suggested experiments in revision.

read point-by-point responses

Referee: Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.

Authors: We agree that additional controls with matched-dimensionality non-semantic signals would help isolate whether gains derive from specific representational alignment rather than generic auxiliary-task effects. Our existing comparisons to seed ensembles (varying only random initializations while holding targets fixed) already demonstrate that the combined acoustic+expectation model exceeds ensemble performance, providing evidence against purely non-specific regularization or initialization benefits. To directly address the referee's point, we will add ablations in the revised manuscript using shuffled versions of the acoustic and expectation features (preserving dimensionality) as well as unrelated audio statistics, with the pretraining procedure held identical. These will be reported alongside the original results to clarify the role of representation type. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pretraining and evaluation chain is self-contained

full rationale

The paper describes an empirical workflow of pretraining models to predict acoustic and expectation ANN representations as teacher targets, followed by downstream evaluation on EEG music identification performance. Claims rest on reported outperformance versus non-pretrained baselines and seed ensembles, with no equations, fitted parameters renamed as predictions, or self-referential definitions visible in the provided text. Prior-work citations on ANN-cortical resemblance are external and not load-bearing for the central experimental result. The derivation does not reduce to inputs by construction and remains falsifiable via the described ablations and comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so specific free parameters, axioms, or invented entities cannot be extracted; the work implicitly relies on standard assumptions of ANN representational similarity to cortex and supervised pretraining efficacy.

pith-pipeline@v0.9.0 · 5688 in / 1056 out tokens · 43046 ms · 2026-05-21T11:32:00.882454+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 9 internal anchors

[1]

Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018

Koelsch, S., Vuust, P., Friston, K.: Predictive processes and the peculiar case of music. Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018. 10.006

work page doi:10.1016/j.tics.2018 2019
[2]

Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

Vuust, P., Heggli, O.A., Friston, K.J., Kringelbach, M.L.: Music in the brain. Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

work page 2022
[3]

MIT Press, Cambridge, MA (2006)

Huron, D.: Sweet Anticipation: Music and the Psychology of Expectation. MIT Press, Cambridge, MA (2006). https://doi.org/10.7551/mitpress/6575.001.0001

work page doi:10.7551/mitpress/6575.001.0001 2006
[4]

Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

Juslin, P.N., Västfjäll, D.: Emotional responses to music: The need to consider underlying mechanisms. Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

work page doi:10.1017/s0140525x08005293 2008
[5]

Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

Salimpoor, V.N., Zald, D.H., Zatorre, R.J., Dagher, A., McIntosh, A.R.: Predictions and the brain: How musical sounds become rewarding. Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

work page doi:10.1016/j.tics.2014.12.001 2015
[6]

Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

Krumhansl, C.L.: Statistics, structure, and style in music. Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

work page doi:10.1525/mp.2015.33.1.20 2015
[7]

Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

Patel, A.D.: Language, music, syntax and the brain. Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

work page doi:10.1038/nn1082 2003
[8]

Masset, R

Rohrmeier, M., Rebuschat, P., Cross, I.: Incidental and online learning of melodic structure. Consciousness and Cognition20(2011) https://doi.org/10.1016/j. concog.2010.07.004

work page doi:10.1016/j 2011
[9]

WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

Tillmann, B., Poulin-Charronnat, B., Bigand, E.: The role of expectation in music: from the score to emotions and the brain. WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

work page doi:10.1002/wcs.1262 2014
[10]

Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience11(2010) https://doi.org/10.1038/nrn2787

work page doi:10.1038/nrn2787 2010
[11]

Friston, K.J., Friston, D.A.: A Free Energy Formulation of Music Generation and Perception: Helmholtz Revisited, pp. 43–69. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00107-4_2

work page doi:10.1007/978-3-319-00107-4_2 2013
[12]

Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

Ishida, K., Nittono, H.: Active inference in music perception: Motor engagement to syncopation modulates rhythmic prediction error. Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

work page doi:10.1111/psyp.70113 2025
[13]

NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

Pearce, M.T., Ruiz, M.H., Kapasi, S., Wiggins, G.A., Bhattacharya, J.: Unsupervised statistical learning underpins computational, behavioural, and neural manifestations of musical expectation. NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

work page doi:10.1016/j.neuroimage.2009.12.019 2010
[14]

eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

Di Liberto, G.M.,et al.: Cortical encoding of melodic expectations in human temporal cortex. eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

work page doi:10.7554/elife.51784 2020
[15]

Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011

Koelsch, S.: Toward a neural basis of music perception – a review and updated model. Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011. 00110

work page doi:10.3389/fpsyg.2011 2011
[16]

Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

Koelsch, S., Jentschke, S.: Short-term effects of processing musical syntax: An ERP study. Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

work page doi:10.1016/j.brainres.2007.10.078 2008
[17]

Behavioural Neurology2015, 469508 (2015) https://doi

Yu, X., Liu, T., Gao, D.: The mismatch negativity: An indicator of perception of regularities in music. Behavioural Neurology2015, 469508 (2015) https://doi. org/10.1155/2015/469508

work page doi:10.1155/2015/469508 2015
[18]

Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

Brattico, E., Tervaniemi, M., Näätänen, R., Peretz, I.: Musical scale properties are automatically processed in the human auditory cortex. Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

work page doi:10.1016/j.brainres.2006.08.023 2006
[19]

Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

Mencke, I., Quiroga-Martinez, D.R., Omigie, D., Michalareas, G., Schwarzacher, F., Haumann, N.T., Vuust, P., Brattico, E.: Prediction under uncertainty: Dissociating sensory from cognitive expectations in highly uncertain musical contexts. Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

work page doi:10.1016/j.brainres.2021.147664 2021
[20]

NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

Quiroga-Martinez, D.R., Hansen, N.C., Højlund, A., Pearce, M., Brattico, E., Vuust, P.: Decomposing neural responses to melodic surprise in musicians and non-musicians: Evidence for a hierarchy of predictions in the auditory system. NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

work page doi:10.1016/j.neuroimage.2020.116816 2020
[21]

PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

Koelsch, S., Kilches, S., Steinbeis, N., Schelinski, S.: Effects of unexpected chords and of performer’s expression on brain responses and electrodermal activity. PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

work page doi:10.1371/journal.pone.0002631 2008
[22]

Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

Carrus, E., Pearce, M.T., Bhattacharya, J.: Melodic pitch expectation interacts with neural responses to syntactic but not semantic violations. Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

work page doi:10.1016/j.cortex.2012.08.024 2013
[23]

Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

Steinbeis, N., Koelsch, S., Sloboda, J.A.: The role of harmonic expectancy violations in musical emotions: Evidence from subjective, physiological, and neural responses. Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

work page 2006
[24]

Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

Omigie, D., Pearce, M.T., Williamson, V.J., Stewart, L.: Electrophysiological correlates of melodic processing in congenital amusia. Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

work page doi:10.1016/j.neuropsychologia.2013.05.010 2013
[25]

Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

Choi, I., Bharadwaj, H.M., Bressler, S., Loui, P., Lee, K., Shinn-Cunningham, B.G.: Automatic processing of abstract musical tonality. Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

work page doi:10.3389/fnhum.2014.00988 2014
[26]

Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

Heacock, R.M., Pigeon, A., Chermak, G., Musiek, F., Weihing, J.: Enhancement of the auditory late response (N1-P2) by presentation of stimuli from an unexpected location. Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

work page doi:10.3766/jaaa.17047 2019
[27]

NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

Miranda, R.A., Ullman, M.T.: Double dissociation between rules and memory in music: An event-related potential study. NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

work page doi:10.1016/j.neuroimage.2007.07.034 2007
[28]

Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

Omigie, D., Pearce, M., Lehongre, K., Hasboun, D., Navarro, V., Adam, C., Samson,S.:Intracranialrecordingsandcomputationalmodelingofmusicrevealthe time course of prediction error signaling in frontal and temporal cortices. Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

work page doi:10.1162/jocn_a_01388 2019
[29]

European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

Lumaca, M., Trusbak Haumann, N., Brattico, E., Grube, M., Vuust, P.: Weighting of neural prediction error by rhythmic complexity: A predictive coding account using mismatch negativity. European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

work page doi:10.1111/ejn.14329 2019
[30]

Scientific Reports14(2024) https://doi

Ono, K., Mizuochi, R., Yamamoto, K., Sasaoka, T., Yamawaki, S.: Exploring the neural underpinnings of chord prediction uncertainty: an 34 electroencephalography (EEG) study. Scientific Reports14(2024) https://doi. org/10.1038/s41598-024-55366-1

work page doi:10.1038/s41598-024-55366-1 2024
[31]

International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

Tanovic, E., Joormann, J.: Anticipating the unknown: The stimulus-preceding negativity is enhanced by uncertain threat. International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

work page doi:10.1016/j.ijpsycho.2019.03.009 2019
[32]

eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

Kern, P., Heilbron, M., Lange, F.P., Spaak, E.: Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience. eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

work page doi:10.7554/elife.80935 2023
[33]

European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

Galeano-Otálvaro, J.-D., Martorell, J., Meyer, L., Titone, L.: Neural encoding of melodic expectations in music across EEG frequency bands. European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

work page doi:10.1111/ejn.16581 2024
[34]

Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

Mischler, G., Li, Y.A., Bickel, S., Mehta, A.D., Mesgarani, N.: The impact of musical expertise on disentangled and contextual neural encoding of music revealed by generative music models. Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

work page doi:10.1038/s41467-025-63961-7 2025
[35]

PLOS Biology21, 3002176 (2023) https://doi.org/10

Bellier, L., Llorens, A., Marciano, D., Gunduz, A., Schalk, G., Brunner, P., Knight, R.T.: Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLOS Biology21, 3002176 (2023) https://doi.org/10. 1371/journal.pbio.3002176

work page 2023
[36]

PLOS Biology21(2023) https://doi

Tuckute, G., Feather, J., Boebinger, D., McDermott, J.H.: Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLOS Biology21(2023) https://doi. org/10.1371/journal.pbio.3002366

work page doi:10.1371/journal.pbio.3002366 2023
[37]

Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

Akama, T., Zhang, Z., Li, P., Hongo, K., Minamikawa, S., Polouliakh, N.: Predicting artificial neural network representations to learn recognition model for music identification from brain recordings. Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

work page doi:10.1038/s41598-025-02790-6 2025
[38]

Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

Daly, I.: Neural decoding of music from the EEG. Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

work page doi:10.1038/s41598-022-27361-x 2023
[39]

Hu- man Brain Mapping (aug 2017)

Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping38(2017) https://doi.org/10.1002/hbm.23730

work page doi:10.1002/hbm.23730 2017
[40]

Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., King, J.-R.: Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

work page doi:10.1038/s42256-023-00714-5 2023
[41]

In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017)

Losorelli, S., Nguyen, D.T., Dmochowski, J.P., Kaneshiro, B.: NMED-T: A tempo-focused dataset of cortical and behavioral responses to naturalistic music. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017). https://exhibits.stanford.edu/data/catalog/ jn859kj8079

work page 2017
[42]

Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen

Zhu, H., Zhou, Y., Chen, H., Yu, J., Ma, Z., Gu, R., Luo, Y., Tan, W., Chen, X.: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization (2025). https://doi.org/10.48550/arXiv.2501.01108 . https: //arxiv.org/abs/2501.01108

work page doi:10.48550/arxiv.2501.01108 2025
[43]

Strongly Recommend Advancing

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., Défossez, A.: Simple and controllable music generation. In: Advances in Neural Information Processing Systems (2023). https://doi.org/10.48550/arXiv.2306.05284

work page doi:10.48550/arxiv.2306.05284 2023
[44]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (2017). https://doi.org/10.48550/ arXiv.1612.01474 . https://arxiv.org/abs/1612.01474

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

Fort, S., Hu, H., Lakshminarayanan, B.: Deep Ensembles: A Loss Landscape Perspective (2019). https://doi.org/10.48550/arXiv.1912.02757 . https://arxiv. org/abs/1912.02757

work page doi:10.48550/arxiv.1912.02757 2019
[46]

High Fidelity Neural Audio Compression

Défossez, A., Copet, J., Synnaeve, G., Adi, Y.: High Fidelity Neural Audio Compression (2022). https://doi.org/10.48550/arXiv.2210.13438 . https://arxiv. org/abs/2210.13438

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.13438 2022
[47]

In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019)

Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The MTG-Jamendo Dataset for Automatic Music Tagging. In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019). https://doi.org/10.5281/zenodo.3826813

work page doi:10.5281/zenodo.3826813 2019
[48]

In: Advances in Neural Information Processing Systems, vol

Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., King, J.-R.: Toward a realistic model of speech processing in the brain with self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 33428–33443 (2022). https://proceedings.neurips.cc/paper_ files/paper/2022/file/d81ecfc8fb18e8...

work page 2022
[49]

https://doi.org/10.48550/ arXiv.2205.14252

Vaidya, A.R., Jain, S., Huth, A.G.: Self-supervised models of audio effectively explain human cortical responses to speech (2022). https://doi.org/10.48550/ arXiv.2205.14252 . https://arxiv.org/abs/2205.14252

work page arXiv 2022
[50]

In: International Conference on Learning Representations (ICLR) (2019)

Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure. In: International Conference on Learning Representations (ICLR) (2019). https://doi.org/10.48550/arXiv.1809. 04281 36

work page doi:10.48550/arxiv.1809 2019
[51]

Audiolm: a language modeling approach to audio generation, 2023

Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., Zeghidour, N.: AudioLM: a language modeling approach to audio generation. arXiv preprint (2022) https: //doi.org/10.48550/arXiv.2209.03143 arXiv:2209.03143

work page doi:10.48550/arxiv.2209.03143 2022
[52]

Jukebox: A Generative Model for Music

Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020) https: //doi.org/10.48550/arXiv.2005.00341

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.00341 2005
[53]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., Frank, C.: Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023) https://doi.org/10.48550/arXiv.2301.11325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.11325 2023
[54]

Disentangled representation learning, 2024

Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) https://doi.org/10.48550/arXiv.2211.11695

work page doi:10.48550/arxiv.2211.11695 2024
[55]

https://doi.org/10.48550/arXiv.2205.14540

Liang, F., Li, Y., Marculescu, D.: SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners (2024). https://doi.org/10.48550/arXiv.2205.14540 . https://arxiv.org/abs/2205.14540

work page doi:10.48550/arxiv.2205.14540 2024
[56]

Journal of Machine Learning Research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12, 2825–2830 (2011)

work page 2011
[57]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Oota, S.R., Pahwa, K., Marreddy, M., Gupta, M., Raju, B.S.: Neural architecture of speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023. 10096248

work page doi:10.1109/icassp49357.2023 2023
[58]

https://doi.org/10.48550/arXiv.2307.10246

Oota, S.R., Chen, Z., Gupta, M., Bapi, R.S., Jobard, G., Alexandre, F., Hinaut, X.: Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey) (2024). https://doi.org/10.48550/arXiv.2307.10246 . https://arxiv.org/ abs/2307.10246

work page doi:10.48550/arxiv.2307.10246 2024
[59]

Mert: Acoustic music understanding model with large-scale self-supervised training,

Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., Gyenge, N., Dannenberg, R., Liu, R., Chen, W., Xia, G., Shi, Y., Huang, W., Wang, Z., Guo, Y., Fu, J.: MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training (2024). https: //doi.org/10.48550/arXiv.2306.00107 . https://arxiv.org/a...

work page doi:10.48550/arxiv.2306.00107 2024
[60]

https://doi.org/10.48550/arXiv.2311.03318

Won, M., Hung, Y.-N., Le, D.: A Foundation Model for Music Informatics (2023). https://doi.org/10.48550/arXiv.2311.03318 . https://arxiv.org/abs/2311.03318

work page doi:10.48550/arxiv.2311.03318 2023
[61]

https://doi.org/10.48550/arXiv.2306.10548

Yuan, R., Ma, Y., Li, Y., Zhang, G., Chen, X., Yin, H., Zhuo, L., Liu, Y., Huang, J., Tian, Z., Deng, B., Wang, N., Lin, C., Benetos, E., Ragni, A., Gyenge, N., 37 Dannenberg, R., Chen, W., Xia, G., Xue, W., Liu, S., Wang, S., Liu, R., Guo, Y., Fu, J.: MARBLE: Music Audio Representation Benchmark for Universal Evaluation (2023). https://doi.org/10.48550/a...

work page doi:10.48550/arxiv.2306.10548 2023
[62]

https://doi.org/10

Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J.Y., Ellis, D.P.W.: MuLan: A Joint Embedding of Music Audio and Natural Language (2022). https://doi.org/10. 48550/arXiv.2208.12415 . https://arxiv.org/abs/2208.12415

work page arXiv 2022
[63]

Dickerson

Elizalde, B., Deshmukh, S., Wang, H.: Natural Language Supervision for General-Purpose Audio Representations (2024). https://doi.org/10.48550/arXiv. 2309.05767 . https://arxiv.org/abs/2309.05767

work page internal anchor Pith review doi:10.48550/arxiv 2024
[64]

https://doi.org/10.5281/zenodo.15006942

McFee, B., et al.: Librosa 0.11.0. https://doi.org/10.5281/zenodo.15006942 . https: //doi.org/10.5281/zenodo.15006942

work page doi:10.5281/zenodo.15006942
[65]

In: Burstein, J., Doran, C., Solorio, T

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019
[66]

OpenAI technical report (2018)

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving Language Understanding by Generative Pre-Training. OpenAI technical report (2018). https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

work page 2018
[67]

OpenAI technical report (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners. OpenAI technical report (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019
[68]

Masked Autoencoders Are Scalable Vision Learners

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked Autoencoders Are Scalable Vision Learners (2021). https://doi.org/10.48550/arXiv.2111.06377 . https://arxiv.org/abs/2111.06377

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.06377 2021
[69]

BEiT: BERT Pre-Training of Image Transformers

Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT Pre-Training of Image Transformers (2022). https://doi.org/10.48550/arXiv.2106.08254 . https://arxiv. org/abs/2106.08254

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.08254 2022
[70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: SimMIM: A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https: //doi.org/10.48550/arXiv.2111.09886 38

work page doi:10.48550/arxiv.2111.09886 2022
[71]

In: International Conference on Learning Representations (ICLR) (2024)

Jiang, W.-B., Zhao, L.-M., Lu, B.-L.: Large brain model for learning generic representations with tremendous EEG data in BCI. In: International Conference on Learning Representations (ICLR) (2024). https://openreview.net/group?id= ICLR.cc/2024/Conference

work page 2024
[72]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021). https://doi.org/10. 48550/arXiv.2103.00020 . https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

https://doi.org/10

Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked Autoencoders that Listen (2023). https://doi.org/10. 48550/arXiv.2207.06405 . https://arxiv.org/abs/2207.06405

work page arXiv 2023
[74]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2017). https://doi.org/10.48550/arXiv.1412.6980 . https://arxiv.org/abs/1412.6980 Supplementary information Supplementary Note 1: Algorithmic details of MuQ extraction In Algorithm 1, the input waveform is transformed explicitly asx→ x1 → x2 → x3. In Algorithm 2, K-means is run with K-means...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980 2017

[1] [1]

Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018

Koelsch, S., Vuust, P., Friston, K.: Predictive processes and the peculiar case of music. Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018. 10.006

work page doi:10.1016/j.tics.2018 2019

[2] [2]

Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

Vuust, P., Heggli, O.A., Friston, K.J., Kringelbach, M.L.: Music in the brain. Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

work page 2022

[3] [3]

MIT Press, Cambridge, MA (2006)

Huron, D.: Sweet Anticipation: Music and the Psychology of Expectation. MIT Press, Cambridge, MA (2006). https://doi.org/10.7551/mitpress/6575.001.0001

work page doi:10.7551/mitpress/6575.001.0001 2006

[4] [4]

Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

Juslin, P.N., Västfjäll, D.: Emotional responses to music: The need to consider underlying mechanisms. Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

work page doi:10.1017/s0140525x08005293 2008

[5] [5]

Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

Salimpoor, V.N., Zald, D.H., Zatorre, R.J., Dagher, A., McIntosh, A.R.: Predictions and the brain: How musical sounds become rewarding. Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

work page doi:10.1016/j.tics.2014.12.001 2015

[6] [6]

Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

Krumhansl, C.L.: Statistics, structure, and style in music. Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

work page doi:10.1525/mp.2015.33.1.20 2015

[7] [7]

Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

Patel, A.D.: Language, music, syntax and the brain. Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

work page doi:10.1038/nn1082 2003

[8] [8]

Masset, R

Rohrmeier, M., Rebuschat, P., Cross, I.: Incidental and online learning of melodic structure. Consciousness and Cognition20(2011) https://doi.org/10.1016/j. concog.2010.07.004

work page doi:10.1016/j 2011

[9] [9]

WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

Tillmann, B., Poulin-Charronnat, B., Bigand, E.: The role of expectation in music: from the score to emotions and the brain. WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

work page doi:10.1002/wcs.1262 2014

[10] [10]

Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience11(2010) https://doi.org/10.1038/nrn2787

work page doi:10.1038/nrn2787 2010

[11] [11]

Friston, K.J., Friston, D.A.: A Free Energy Formulation of Music Generation and Perception: Helmholtz Revisited, pp. 43–69. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00107-4_2

work page doi:10.1007/978-3-319-00107-4_2 2013

[12] [12]

Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

Ishida, K., Nittono, H.: Active inference in music perception: Motor engagement to syncopation modulates rhythmic prediction error. Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

work page doi:10.1111/psyp.70113 2025

[13] [13]

NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

Pearce, M.T., Ruiz, M.H., Kapasi, S., Wiggins, G.A., Bhattacharya, J.: Unsupervised statistical learning underpins computational, behavioural, and neural manifestations of musical expectation. NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

work page doi:10.1016/j.neuroimage.2009.12.019 2010

[14] [14]

eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

Di Liberto, G.M.,et al.: Cortical encoding of melodic expectations in human temporal cortex. eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

work page doi:10.7554/elife.51784 2020

[15] [15]

Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011

Koelsch, S.: Toward a neural basis of music perception – a review and updated model. Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011. 00110

work page doi:10.3389/fpsyg.2011 2011

[16] [16]

Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

Koelsch, S., Jentschke, S.: Short-term effects of processing musical syntax: An ERP study. Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

work page doi:10.1016/j.brainres.2007.10.078 2008

[17] [17]

Behavioural Neurology2015, 469508 (2015) https://doi

Yu, X., Liu, T., Gao, D.: The mismatch negativity: An indicator of perception of regularities in music. Behavioural Neurology2015, 469508 (2015) https://doi. org/10.1155/2015/469508

work page doi:10.1155/2015/469508 2015

[18] [18]

Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

Brattico, E., Tervaniemi, M., Näätänen, R., Peretz, I.: Musical scale properties are automatically processed in the human auditory cortex. Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

work page doi:10.1016/j.brainres.2006.08.023 2006

[19] [19]

Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

Mencke, I., Quiroga-Martinez, D.R., Omigie, D., Michalareas, G., Schwarzacher, F., Haumann, N.T., Vuust, P., Brattico, E.: Prediction under uncertainty: Dissociating sensory from cognitive expectations in highly uncertain musical contexts. Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

work page doi:10.1016/j.brainres.2021.147664 2021

[20] [20]

NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

Quiroga-Martinez, D.R., Hansen, N.C., Højlund, A., Pearce, M., Brattico, E., Vuust, P.: Decomposing neural responses to melodic surprise in musicians and non-musicians: Evidence for a hierarchy of predictions in the auditory system. NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

work page doi:10.1016/j.neuroimage.2020.116816 2020

[21] [21]

PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

Koelsch, S., Kilches, S., Steinbeis, N., Schelinski, S.: Effects of unexpected chords and of performer’s expression on brain responses and electrodermal activity. PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

work page doi:10.1371/journal.pone.0002631 2008

[22] [22]

Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

Carrus, E., Pearce, M.T., Bhattacharya, J.: Melodic pitch expectation interacts with neural responses to syntactic but not semantic violations. Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

work page doi:10.1016/j.cortex.2012.08.024 2013

[23] [23]

Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

Steinbeis, N., Koelsch, S., Sloboda, J.A.: The role of harmonic expectancy violations in musical emotions: Evidence from subjective, physiological, and neural responses. Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

work page 2006

[24] [24]

Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

Omigie, D., Pearce, M.T., Williamson, V.J., Stewart, L.: Electrophysiological correlates of melodic processing in congenital amusia. Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

work page doi:10.1016/j.neuropsychologia.2013.05.010 2013

[25] [25]

Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

Choi, I., Bharadwaj, H.M., Bressler, S., Loui, P., Lee, K., Shinn-Cunningham, B.G.: Automatic processing of abstract musical tonality. Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

work page doi:10.3389/fnhum.2014.00988 2014

[26] [26]

Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

Heacock, R.M., Pigeon, A., Chermak, G., Musiek, F., Weihing, J.: Enhancement of the auditory late response (N1-P2) by presentation of stimuli from an unexpected location. Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

work page doi:10.3766/jaaa.17047 2019

[27] [27]

NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

Miranda, R.A., Ullman, M.T.: Double dissociation between rules and memory in music: An event-related potential study. NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

work page doi:10.1016/j.neuroimage.2007.07.034 2007

[28] [28]

Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

Omigie, D., Pearce, M., Lehongre, K., Hasboun, D., Navarro, V., Adam, C., Samson,S.:Intracranialrecordingsandcomputationalmodelingofmusicrevealthe time course of prediction error signaling in frontal and temporal cortices. Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

work page doi:10.1162/jocn_a_01388 2019

[29] [29]

European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

Lumaca, M., Trusbak Haumann, N., Brattico, E., Grube, M., Vuust, P.: Weighting of neural prediction error by rhythmic complexity: A predictive coding account using mismatch negativity. European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

work page doi:10.1111/ejn.14329 2019

[30] [30]

Scientific Reports14(2024) https://doi

Ono, K., Mizuochi, R., Yamamoto, K., Sasaoka, T., Yamawaki, S.: Exploring the neural underpinnings of chord prediction uncertainty: an 34 electroencephalography (EEG) study. Scientific Reports14(2024) https://doi. org/10.1038/s41598-024-55366-1

work page doi:10.1038/s41598-024-55366-1 2024

[31] [31]

International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

Tanovic, E., Joormann, J.: Anticipating the unknown: The stimulus-preceding negativity is enhanced by uncertain threat. International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

work page doi:10.1016/j.ijpsycho.2019.03.009 2019

[32] [32]

eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

Kern, P., Heilbron, M., Lange, F.P., Spaak, E.: Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience. eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

work page doi:10.7554/elife.80935 2023

[33] [33]

European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

Galeano-Otálvaro, J.-D., Martorell, J., Meyer, L., Titone, L.: Neural encoding of melodic expectations in music across EEG frequency bands. European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

work page doi:10.1111/ejn.16581 2024

[34] [34]

Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

Mischler, G., Li, Y.A., Bickel, S., Mehta, A.D., Mesgarani, N.: The impact of musical expertise on disentangled and contextual neural encoding of music revealed by generative music models. Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

work page doi:10.1038/s41467-025-63961-7 2025

[35] [35]

PLOS Biology21, 3002176 (2023) https://doi.org/10

Bellier, L., Llorens, A., Marciano, D., Gunduz, A., Schalk, G., Brunner, P., Knight, R.T.: Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLOS Biology21, 3002176 (2023) https://doi.org/10. 1371/journal.pbio.3002176

work page 2023

[36] [36]

PLOS Biology21(2023) https://doi

Tuckute, G., Feather, J., Boebinger, D., McDermott, J.H.: Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLOS Biology21(2023) https://doi. org/10.1371/journal.pbio.3002366

work page doi:10.1371/journal.pbio.3002366 2023

[37] [37]

Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

Akama, T., Zhang, Z., Li, P., Hongo, K., Minamikawa, S., Polouliakh, N.: Predicting artificial neural network representations to learn recognition model for music identification from brain recordings. Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

work page doi:10.1038/s41598-025-02790-6 2025

[38] [38]

Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

Daly, I.: Neural decoding of music from the EEG. Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

work page doi:10.1038/s41598-022-27361-x 2023

[39] [39]

Hu- man Brain Mapping (aug 2017)

Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping38(2017) https://doi.org/10.1002/hbm.23730

work page doi:10.1002/hbm.23730 2017

[40] [40]

Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., King, J.-R.: Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

work page doi:10.1038/s42256-023-00714-5 2023

[41] [41]

In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017)

Losorelli, S., Nguyen, D.T., Dmochowski, J.P., Kaneshiro, B.: NMED-T: A tempo-focused dataset of cortical and behavioral responses to naturalistic music. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017). https://exhibits.stanford.edu/data/catalog/ jn859kj8079

work page 2017

[42] [42]

Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen

Zhu, H., Zhou, Y., Chen, H., Yu, J., Ma, Z., Gu, R., Luo, Y., Tan, W., Chen, X.: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization (2025). https://doi.org/10.48550/arXiv.2501.01108 . https: //arxiv.org/abs/2501.01108

work page doi:10.48550/arxiv.2501.01108 2025

[43] [43]

Strongly Recommend Advancing

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., Défossez, A.: Simple and controllable music generation. In: Advances in Neural Information Processing Systems (2023). https://doi.org/10.48550/arXiv.2306.05284

work page doi:10.48550/arxiv.2306.05284 2023

[44] [44]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (2017). https://doi.org/10.48550/ arXiv.1612.01474 . https://arxiv.org/abs/1612.01474

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

Fort, S., Hu, H., Lakshminarayanan, B.: Deep Ensembles: A Loss Landscape Perspective (2019). https://doi.org/10.48550/arXiv.1912.02757 . https://arxiv. org/abs/1912.02757

work page doi:10.48550/arxiv.1912.02757 2019

[46] [46]

High Fidelity Neural Audio Compression

Défossez, A., Copet, J., Synnaeve, G., Adi, Y.: High Fidelity Neural Audio Compression (2022). https://doi.org/10.48550/arXiv.2210.13438 . https://arxiv. org/abs/2210.13438

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.13438 2022

[47] [47]

In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019)

Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The MTG-Jamendo Dataset for Automatic Music Tagging. In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019). https://doi.org/10.5281/zenodo.3826813

work page doi:10.5281/zenodo.3826813 2019

[48] [48]

In: Advances in Neural Information Processing Systems, vol

Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., King, J.-R.: Toward a realistic model of speech processing in the brain with self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 33428–33443 (2022). https://proceedings.neurips.cc/paper_ files/paper/2022/file/d81ecfc8fb18e8...

work page 2022

[49] [49]

https://doi.org/10.48550/ arXiv.2205.14252

Vaidya, A.R., Jain, S., Huth, A.G.: Self-supervised models of audio effectively explain human cortical responses to speech (2022). https://doi.org/10.48550/ arXiv.2205.14252 . https://arxiv.org/abs/2205.14252

work page arXiv 2022

[50] [50]

In: International Conference on Learning Representations (ICLR) (2019)

Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure. In: International Conference on Learning Representations (ICLR) (2019). https://doi.org/10.48550/arXiv.1809. 04281 36

work page doi:10.48550/arxiv.1809 2019

[51] [51]

Audiolm: a language modeling approach to audio generation, 2023

Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., Zeghidour, N.: AudioLM: a language modeling approach to audio generation. arXiv preprint (2022) https: //doi.org/10.48550/arXiv.2209.03143 arXiv:2209.03143

work page doi:10.48550/arxiv.2209.03143 2022

[52] [52]

Jukebox: A Generative Model for Music

Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020) https: //doi.org/10.48550/arXiv.2005.00341

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.00341 2005

[53] [53]

MusicLM: Generating Music From Text

Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., Frank, C.: Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023) https://doi.org/10.48550/arXiv.2301.11325

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.11325 2023

[54] [54]

Disentangled representation learning, 2024

Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) https://doi.org/10.48550/arXiv.2211.11695

work page doi:10.48550/arxiv.2211.11695 2024

[55] [55]

https://doi.org/10.48550/arXiv.2205.14540

Liang, F., Li, Y., Marculescu, D.: SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners (2024). https://doi.org/10.48550/arXiv.2205.14540 . https://arxiv.org/abs/2205.14540

work page doi:10.48550/arxiv.2205.14540 2024

[56] [56]

Journal of Machine Learning Research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12, 2825–2830 (2011)

work page 2011

[57] [57]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Oota, S.R., Pahwa, K., Marreddy, M., Gupta, M., Raju, B.S.: Neural architecture of speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023. 10096248

work page doi:10.1109/icassp49357.2023 2023

[58] [58]

https://doi.org/10.48550/arXiv.2307.10246

Oota, S.R., Chen, Z., Gupta, M., Bapi, R.S., Jobard, G., Alexandre, F., Hinaut, X.: Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey) (2024). https://doi.org/10.48550/arXiv.2307.10246 . https://arxiv.org/ abs/2307.10246

work page doi:10.48550/arxiv.2307.10246 2024

[59] [59]

Mert: Acoustic music understanding model with large-scale self-supervised training,

Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., Gyenge, N., Dannenberg, R., Liu, R., Chen, W., Xia, G., Shi, Y., Huang, W., Wang, Z., Guo, Y., Fu, J.: MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training (2024). https: //doi.org/10.48550/arXiv.2306.00107 . https://arxiv.org/a...

work page doi:10.48550/arxiv.2306.00107 2024

[60] [60]

https://doi.org/10.48550/arXiv.2311.03318

Won, M., Hung, Y.-N., Le, D.: A Foundation Model for Music Informatics (2023). https://doi.org/10.48550/arXiv.2311.03318 . https://arxiv.org/abs/2311.03318

work page doi:10.48550/arxiv.2311.03318 2023

[61] [61]

https://doi.org/10.48550/arXiv.2306.10548

Yuan, R., Ma, Y., Li, Y., Zhang, G., Chen, X., Yin, H., Zhuo, L., Liu, Y., Huang, J., Tian, Z., Deng, B., Wang, N., Lin, C., Benetos, E., Ragni, A., Gyenge, N., 37 Dannenberg, R., Chen, W., Xia, G., Xue, W., Liu, S., Wang, S., Liu, R., Guo, Y., Fu, J.: MARBLE: Music Audio Representation Benchmark for Universal Evaluation (2023). https://doi.org/10.48550/a...

work page doi:10.48550/arxiv.2306.10548 2023

[62] [62]

https://doi.org/10

Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J.Y., Ellis, D.P.W.: MuLan: A Joint Embedding of Music Audio and Natural Language (2022). https://doi.org/10. 48550/arXiv.2208.12415 . https://arxiv.org/abs/2208.12415

work page arXiv 2022

[63] [63]

Dickerson

Elizalde, B., Deshmukh, S., Wang, H.: Natural Language Supervision for General-Purpose Audio Representations (2024). https://doi.org/10.48550/arXiv. 2309.05767 . https://arxiv.org/abs/2309.05767

work page internal anchor Pith review doi:10.48550/arxiv 2024

[64] [64]

https://doi.org/10.5281/zenodo.15006942

McFee, B., et al.: Librosa 0.11.0. https://doi.org/10.5281/zenodo.15006942 . https: //doi.org/10.5281/zenodo.15006942

work page doi:10.5281/zenodo.15006942

[65] [65]

In: Burstein, J., Doran, C., Solorio, T

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019

[66] [66]

OpenAI technical report (2018)

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving Language Understanding by Generative Pre-Training. OpenAI technical report (2018). https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

work page 2018

[67] [67]

OpenAI technical report (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners. OpenAI technical report (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019

[68] [68]

Masked Autoencoders Are Scalable Vision Learners

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked Autoencoders Are Scalable Vision Learners (2021). https://doi.org/10.48550/arXiv.2111.06377 . https://arxiv.org/abs/2111.06377

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.06377 2021

[69] [69]

BEiT: BERT Pre-Training of Image Transformers

Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT Pre-Training of Image Transformers (2022). https://doi.org/10.48550/arXiv.2106.08254 . https://arxiv. org/abs/2106.08254

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.08254 2022

[70] [70]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: SimMIM: A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https: //doi.org/10.48550/arXiv.2111.09886 38

work page doi:10.48550/arxiv.2111.09886 2022

[71] [71]

In: International Conference on Learning Representations (ICLR) (2024)

Jiang, W.-B., Zhao, L.-M., Lu, B.-L.: Large brain model for learning generic representations with tremendous EEG data in BCI. In: International Conference on Learning Representations (ICLR) (2024). https://openreview.net/group?id= ICLR.cc/2024/Conference

work page 2024

[72] [72]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021). https://doi.org/10. 48550/arXiv.2103.00020 . https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[73] [73]

https://doi.org/10

Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked Autoencoders that Listen (2023). https://doi.org/10. 48550/arXiv.2207.06405 . https://arxiv.org/abs/2207.06405

work page arXiv 2023

[74] [74]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2017). https://doi.org/10.48550/arXiv.1412.6980 . https://arxiv.org/abs/1412.6980 Supplementary information Supplementary Note 1: Algorithmic details of MuQ extraction In Algorithm 1, the input waveform is transformed explicitly asx→ x1 → x2 → x3. In Algorithm 2, K-means is run with K-means...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980 2017