pith. sign in

arxiv: 2603.03190 · v3 · pith:RYQYVXPInew · submitted 2026-03-03 · 💻 cs.AI · q-bio.NC

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Pith reviewed 2026-05-21 11:32 UTC · model grok-4.3

classification 💻 cs.AI q-bio.NC
keywords EEG decodingmusic identificationANN representationsexpectation modelingacoustic featurespretrainingbrain activitypredictive coding
0
0 comments X

The pith

Pretraining EEG models on acoustic and expectation ANN representations improves music identification from brain activity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that models pretrained to predict either acoustic or expectation-related ANN representations outperform non-pretrained baselines when identifying music from EEG signals. Combining the two representation types produces complementary gains that surpass performance from ensembles created by varying random initializations. This indicates that the specific type of teacher representation influences how well the model captures cortical activity patterns during music listening. The expectation representation is generated directly from raw audio without manual labels and captures predictive structure beyond basic features such as onset or pitch. The results point to the possibility of guiding representation learning for neural decoding by using principles derived from how the brain encodes music.

Core claim

Models pretrained to predict either acoustic or expectation ANN representations outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. The expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch. This shows that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding.

What carries the argument

Distinct acoustic and expectation ANN representations used as separate teacher targets to pretrain EEG decoding models for music identification.

If this is right

  • Pretrained models on either representation outperform non-pretrained baselines.
  • Combining acoustic and expectation targets produces gains beyond those from random seed ensembles.
  • Teacher representation type shapes downstream EEG decoding performance.
  • Representation learning for brain signals can be guided by neural encoding principles.
  • The expectation representation enables investigation of multilayer predictive encoding without manual labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pretraining on distinct representation types could be tested for decoding other cognitive processes that involve prediction, such as speech from EEG.
  • The method may support creation of more general-purpose EEG models that scale to larger and more varied stimulus sets.
  • Future experiments could check whether these gains hold when the music stimuli or listener groups differ substantially from the training data.

Load-bearing premise

The acoustic and expectation ANN representations distinctly and accurately reflect separable components of cortical activity during music listening.

What would settle it

An experiment showing that pretraining on unrelated or randomly generated targets produces comparable gains in EEG music identification accuracy would indicate the improvements stem from general pretraining rather than the specific content of these representations.

read the original abstract

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that pretraining EEG decoding models to predict acoustic and expectation-related representations extracted from ANNs improves music identification performance from brain activity. Pretrained models on either target outperform non-pretrained baselines, and their combination produces complementary gains that exceed those obtained from strong ensembles created by varying random initializations. The expectation representation is derived directly from raw audio signals without manual labels and is argued to capture predictive structure beyond onsets or pitch, supporting scalable, label-free investigation of multilayer predictive encoding in music cognition.

Significance. If the central empirical claims hold after appropriate controls, the work provides evidence that specific ANN-derived representations aligned with hypothesized cortical components can serve as effective teacher signals for EEG representation learning. This could support development of general-purpose neural decoding models grounded in cortical encoding principles and advance predictive models of music cognition. The label-free computation of the expectation representation and its scalability to diverse datasets are potential strengths for broader applicability.

major comments (1)
  1. Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.
minor comments (3)
  1. Methods section: Additional details are needed on EEG dataset size, number of subjects, preprocessing steps, exact ANN architectures used for target extraction, and the downstream EEG model architecture to allow full reproducibility and assessment of post-hoc choices.
  2. Figures and results: Performance plots and tables should include error bars, standard deviations, or statistical significance tests to quantify variability across seeds and subjects, as the abstract reports improvements without visible uncertainty measures.
  3. Abstract and introduction: Clarify the precise definition and computation of the 'expectation representation' early on, including how it differs from basic acoustic features like onset or pitch, to strengthen the claim of capturing multilayer predictive encoding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on our manuscript. We address the concern about control ablations below and will incorporate the suggested experiments in revision.

read point-by-point responses
  1. Referee: Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.

    Authors: We agree that additional controls with matched-dimensionality non-semantic signals would help isolate whether gains derive from specific representational alignment rather than generic auxiliary-task effects. Our existing comparisons to seed ensembles (varying only random initializations while holding targets fixed) already demonstrate that the combined acoustic+expectation model exceeds ensemble performance, providing evidence against purely non-specific regularization or initialization benefits. To directly address the referee's point, we will add ablations in the revised manuscript using shuffled versions of the acoustic and expectation features (preserving dimensionality) as well as unrelated audio statistics, with the pretraining procedure held identical. These will be reported alongside the original results to clarify the role of representation type. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pretraining and evaluation chain is self-contained

full rationale

The paper describes an empirical workflow of pretraining models to predict acoustic and expectation ANN representations as teacher targets, followed by downstream evaluation on EEG music identification performance. Claims rest on reported outperformance versus non-pretrained baselines and seed ensembles, with no equations, fitted parameters renamed as predictions, or self-referential definitions visible in the provided text. Prior-work citations on ANN-cortical resemblance are external and not load-bearing for the central experimental result. The derivation does not reduce to inputs by construction and remains falsifiable via the described ablations and comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so specific free parameters, axioms, or invented entities cannot be extracted; the work implicitly relies on standard assumptions of ANN representational similarity to cortex and supervised pretraining efficacy.

pith-pipeline@v0.9.0 · 5688 in / 1056 out tokens · 43046 ms · 2026-05-21T11:32:00.882454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 9 internal anchors

  1. [1]

    Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018

    Koelsch, S., Vuust, P., Friston, K.: Predictive processes and the peculiar case of music. Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018. 10.006

  2. [2]

    Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

    Vuust, P., Heggli, O.A., Friston, K.J., Kringelbach, M.L.: Music in the brain. Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5

  3. [3]

    MIT Press, Cambridge, MA (2006)

    Huron, D.: Sweet Anticipation: Music and the Psychology of Expectation. MIT Press, Cambridge, MA (2006). https://doi.org/10.7551/mitpress/6575.001.0001

  4. [4]

    Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

    Juslin, P.N., Västfjäll, D.: Emotional responses to music: The need to consider underlying mechanisms. Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293

  5. [5]

    Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

    Salimpoor, V.N., Zald, D.H., Zatorre, R.J., Dagher, A., McIntosh, A.R.: Predictions and the brain: How musical sounds become rewarding. Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001

  6. [6]

    Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

    Krumhansl, C.L.: Statistics, structure, and style in music. Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20

  7. [7]

    Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

    Patel, A.D.: Language, music, syntax and the brain. Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32

  8. [8]

    Masset, R

    Rohrmeier, M., Rebuschat, P., Cross, I.: Incidental and online learning of melodic structure. Consciousness and Cognition20(2011) https://doi.org/10.1016/j. concog.2010.07.004

  9. [9]

    WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

    Tillmann, B., Poulin-Charronnat, B., Bigand, E.: The role of expectation in music: from the score to emotions and the brain. WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262

  10. [10]

    Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience11(2010) https://doi.org/10.1038/nrn2787

  11. [11]

    Friston, K.J., Friston, D.A.: A Free Energy Formulation of Music Generation and Perception: Helmholtz Revisited, pp. 43–69. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00107-4_2

  12. [12]

    Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

    Ishida, K., Nittono, H.: Active inference in music perception: Motor engagement to syncopation modulates rhythmic prediction error. Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113

  13. [13]

    NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

    Pearce, M.T., Ruiz, M.H., Kapasi, S., Wiggins, G.A., Bhattacharya, J.: Unsupervised statistical learning underpins computational, behavioural, and neural manifestations of musical expectation. NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019

  14. [14]

    eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

    Di Liberto, G.M.,et al.: Cortical encoding of melodic expectations in human temporal cortex. eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784

  15. [15]

    Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011

    Koelsch, S.: Toward a neural basis of music perception – a review and updated model. Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011. 00110

  16. [16]

    Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

    Koelsch, S., Jentschke, S.: Short-term effects of processing musical syntax: An ERP study. Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078

  17. [17]

    Behavioural Neurology2015, 469508 (2015) https://doi

    Yu, X., Liu, T., Gao, D.: The mismatch negativity: An indicator of perception of regularities in music. Behavioural Neurology2015, 469508 (2015) https://doi. org/10.1155/2015/469508

  18. [18]

    Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

    Brattico, E., Tervaniemi, M., Näätänen, R., Peretz, I.: Musical scale properties are automatically processed in the human auditory cortex. Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023

  19. [19]

    Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

    Mencke, I., Quiroga-Martinez, D.R., Omigie, D., Michalareas, G., Schwarzacher, F., Haumann, N.T., Vuust, P., Brattico, E.: Prediction under uncertainty: Dissociating sensory from cognitive expectations in highly uncertain musical contexts. Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33

  20. [20]

    NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

    Quiroga-Martinez, D.R., Hansen, N.C., Højlund, A., Pearce, M., Brattico, E., Vuust, P.: Decomposing neural responses to melodic surprise in musicians and non-musicians: Evidence for a hierarchy of predictions in the auditory system. NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816

  21. [21]

    PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

    Koelsch, S., Kilches, S., Steinbeis, N., Schelinski, S.: Effects of unexpected chords and of performer’s expression on brain responses and electrodermal activity. PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631

  22. [22]

    Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

    Carrus, E., Pearce, M.T., Bhattacharya, J.: Melodic pitch expectation interacts with neural responses to syntactic but not semantic violations. Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024

  23. [23]

    Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

    Steinbeis, N., Koelsch, S., Sloboda, J.A.: The role of harmonic expectancy violations in musical emotions: Evidence from subjective, physiological, and neural responses. Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380

  24. [24]

    Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

    Omigie, D., Pearce, M.T., Williamson, V.J., Stewart, L.: Electrophysiological correlates of melodic processing in congenital amusia. Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010

  25. [25]

    Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

    Choi, I., Bharadwaj, H.M., Bressler, S., Loui, P., Lee, K., Shinn-Cunningham, B.G.: Automatic processing of abstract musical tonality. Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988

  26. [26]

    Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

    Heacock, R.M., Pigeon, A., Chermak, G., Musiek, F., Weihing, J.: Enhancement of the auditory late response (N1-P2) by presentation of stimuli from an unexpected location. Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047

  27. [27]

    NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

    Miranda, R.A., Ullman, M.T.: Double dissociation between rules and memory in music: An event-related potential study. NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034

  28. [28]

    Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

    Omigie, D., Pearce, M., Lehongre, K., Hasboun, D., Navarro, V., Adam, C., Samson,S.:Intracranialrecordingsandcomputationalmodelingofmusicrevealthe time course of prediction error signaling in frontal and temporal cortices. Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388

  29. [29]

    European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

    Lumaca, M., Trusbak Haumann, N., Brattico, E., Grube, M., Vuust, P.: Weighting of neural prediction error by rhythmic complexity: A predictive coding account using mismatch negativity. European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329

  30. [30]

    Scientific Reports14(2024) https://doi

    Ono, K., Mizuochi, R., Yamamoto, K., Sasaoka, T., Yamawaki, S.: Exploring the neural underpinnings of chord prediction uncertainty: an 34 electroencephalography (EEG) study. Scientific Reports14(2024) https://doi. org/10.1038/s41598-024-55366-1

  31. [31]

    International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

    Tanovic, E., Joormann, J.: Anticipating the unknown: The stimulus-preceding negativity is enhanced by uncertain threat. International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009

  32. [32]

    eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

    Kern, P., Heilbron, M., Lange, F.P., Spaak, E.: Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience. eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935

  33. [33]

    European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

    Galeano-Otálvaro, J.-D., Martorell, J., Meyer, L., Titone, L.: Neural encoding of melodic expectations in music across EEG frequency bands. European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581

  34. [34]

    Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

    Mischler, G., Li, Y.A., Bickel, S., Mehta, A.D., Mesgarani, N.: The impact of musical expertise on disentangled and contextual neural encoding of music revealed by generative music models. Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7

  35. [35]

    PLOS Biology21, 3002176 (2023) https://doi.org/10

    Bellier, L., Llorens, A., Marciano, D., Gunduz, A., Schalk, G., Brunner, P., Knight, R.T.: Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLOS Biology21, 3002176 (2023) https://doi.org/10. 1371/journal.pbio.3002176

  36. [36]

    PLOS Biology21(2023) https://doi

    Tuckute, G., Feather, J., Boebinger, D., McDermott, J.H.: Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLOS Biology21(2023) https://doi. org/10.1371/journal.pbio.3002366

  37. [37]

    Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

    Akama, T., Zhang, Z., Li, P., Hongo, K., Minamikawa, S., Polouliakh, N.: Predicting artificial neural network representations to learn recognition model for music identification from brain recordings. Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6

  38. [38]

    Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

    Daly, I.: Neural decoding of music from the EEG. Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x

  39. [39]

    Hu- man Brain Mapping (aug 2017)

    Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping38(2017) https://doi.org/10.1002/hbm.23730

  40. [40]

    Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

    Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., King, J.-R.: Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35

  41. [41]

    In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017)

    Losorelli, S., Nguyen, D.T., Dmochowski, J.P., Kaneshiro, B.: NMED-T: A tempo-focused dataset of cortical and behavioral responses to naturalistic music. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017). https://exhibits.stanford.edu/data/catalog/ jn859kj8079

  42. [42]

    Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, and Xie Chen

    Zhu, H., Zhou, Y., Chen, H., Yu, J., Ma, Z., Gu, R., Luo, Y., Tan, W., Chen, X.: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization (2025). https://doi.org/10.48550/arXiv.2501.01108 . https: //arxiv.org/abs/2501.01108

  43. [43]

    Strongly Recommend Advancing

    Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., Défossez, A.: Simple and controllable music generation. In: Advances in Neural Information Processing Systems (2023). https://doi.org/10.48550/arXiv.2306.05284

  44. [44]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

    Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (2017). https://doi.org/10.48550/ arXiv.1612.01474 . https://arxiv.org/abs/1612.01474

  45. [45]

    Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

    Fort, S., Hu, H., Lakshminarayanan, B.: Deep Ensembles: A Loss Landscape Perspective (2019). https://doi.org/10.48550/arXiv.1912.02757 . https://arxiv. org/abs/1912.02757

  46. [46]

    High Fidelity Neural Audio Compression

    Défossez, A., Copet, J., Synnaeve, G., Adi, Y.: High Fidelity Neural Audio Compression (2022). https://doi.org/10.48550/arXiv.2210.13438 . https://arxiv. org/abs/2210.13438

  47. [47]

    In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019)

    Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The MTG-Jamendo Dataset for Automatic Music Tagging. In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019). https://doi.org/10.5281/zenodo.3826813

  48. [48]

    In: Advances in Neural Information Processing Systems, vol

    Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., King, J.-R.: Toward a realistic model of speech processing in the brain with self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 33428–33443 (2022). https://proceedings.neurips.cc/paper_ files/paper/2022/file/d81ecfc8fb18e8...

  49. [49]

    https://doi.org/10.48550/ arXiv.2205.14252

    Vaidya, A.R., Jain, S., Huth, A.G.: Self-supervised models of audio effectively explain human cortical responses to speech (2022). https://doi.org/10.48550/ arXiv.2205.14252 . https://arxiv.org/abs/2205.14252

  50. [50]

    In: International Conference on Learning Representations (ICLR) (2019)

    Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure. In: International Conference on Learning Representations (ICLR) (2019). https://doi.org/10.48550/arXiv.1809. 04281 36

  51. [51]

    Audiolm: a language modeling approach to audio generation, 2023

    Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., Zeghidour, N.: AudioLM: a language modeling approach to audio generation. arXiv preprint (2022) https: //doi.org/10.48550/arXiv.2209.03143 arXiv:2209.03143

  52. [52]

    Jukebox: A Generative Model for Music

    Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020) https: //doi.org/10.48550/arXiv.2005.00341

  53. [53]

    MusicLM: Generating Music From Text

    Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., Frank, C.: Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023) https://doi.org/10.48550/arXiv.2301.11325

  54. [54]

    Disentangled representation learning, 2024

    Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) https://doi.org/10.48550/arXiv.2211.11695

  55. [55]

    https://doi.org/10.48550/arXiv.2205.14540

    Liang, F., Li, Y., Marculescu, D.: SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners (2024). https://doi.org/10.48550/arXiv.2205.14540 . https://arxiv.org/abs/2205.14540

  56. [56]

    Journal of Machine Learning Research12, 2825–2830 (2011)

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12, 2825–2830 (2011)

  57. [57]

    In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

    Oota, S.R., Pahwa, K., Marreddy, M., Gupta, M., Raju, B.S.: Neural architecture of speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023. 10096248

  58. [58]

    https://doi.org/10.48550/arXiv.2307.10246

    Oota, S.R., Chen, Z., Gupta, M., Bapi, R.S., Jobard, G., Alexandre, F., Hinaut, X.: Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey) (2024). https://doi.org/10.48550/arXiv.2307.10246 . https://arxiv.org/ abs/2307.10246

  59. [59]

    Mert: Acoustic music understanding model with large-scale self-supervised training,

    Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., Gyenge, N., Dannenberg, R., Liu, R., Chen, W., Xia, G., Shi, Y., Huang, W., Wang, Z., Guo, Y., Fu, J.: MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training (2024). https: //doi.org/10.48550/arXiv.2306.00107 . https://arxiv.org/a...

  60. [60]

    https://doi.org/10.48550/arXiv.2311.03318

    Won, M., Hung, Y.-N., Le, D.: A Foundation Model for Music Informatics (2023). https://doi.org/10.48550/arXiv.2311.03318 . https://arxiv.org/abs/2311.03318

  61. [61]

    https://doi.org/10.48550/arXiv.2306.10548

    Yuan, R., Ma, Y., Li, Y., Zhang, G., Chen, X., Yin, H., Zhuo, L., Liu, Y., Huang, J., Tian, Z., Deng, B., Wang, N., Lin, C., Benetos, E., Ragni, A., Gyenge, N., 37 Dannenberg, R., Chen, W., Xia, G., Xue, W., Liu, S., Wang, S., Liu, R., Guo, Y., Fu, J.: MARBLE: Music Audio Representation Benchmark for Universal Evaluation (2023). https://doi.org/10.48550/a...

  62. [62]

    https://doi.org/10

    Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J.Y., Ellis, D.P.W.: MuLan: A Joint Embedding of Music Audio and Natural Language (2022). https://doi.org/10. 48550/arXiv.2208.12415 . https://arxiv.org/abs/2208.12415

  63. [63]

    Dickerson

    Elizalde, B., Deshmukh, S., Wang, H.: Natural Language Supervision for General-Purpose Audio Representations (2024). https://doi.org/10.48550/arXiv. 2309.05767 . https://arxiv.org/abs/2309.05767

  64. [64]

    https://doi.org/10.5281/zenodo.15006942

    McFee, B., et al.: Librosa 0.11.0. https://doi.org/10.5281/zenodo.15006942 . https: //doi.org/10.5281/zenodo.15006942

  65. [65]

    In: Burstein, J., Doran, C., Solorio, T

    Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  66. [66]

    OpenAI technical report (2018)

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving Language Understanding by Generative Pre-Training. OpenAI technical report (2018). https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

  67. [67]

    OpenAI technical report (2019)

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners. OpenAI technical report (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

  68. [68]

    Masked Autoencoders Are Scalable Vision Learners

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked Autoencoders Are Scalable Vision Learners (2021). https://doi.org/10.48550/arXiv.2111.06377 . https://arxiv.org/abs/2111.06377

  69. [69]

    BEiT: BERT Pre-Training of Image Transformers

    Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT Pre-Training of Image Transformers (2022). https://doi.org/10.48550/arXiv.2106.08254 . https://arxiv. org/abs/2106.08254

  70. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: SimMIM: A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https: //doi.org/10.48550/arXiv.2111.09886 38

  71. [71]

    In: International Conference on Learning Representations (ICLR) (2024)

    Jiang, W.-B., Zhao, L.-M., Lu, B.-L.: Large brain model for learning generic representations with tremendous EEG data in BCI. In: International Conference on Learning Representations (ICLR) (2024). https://openreview.net/group?id= ICLR.cc/2024/Conference

  72. [72]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021). https://doi.org/10. 48550/arXiv.2103.00020 . https://arxiv.org/abs/2103.00020

  73. [73]

    https://doi.org/10

    Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked Autoencoders that Listen (2023). https://doi.org/10. 48550/arXiv.2207.06405 . https://arxiv.org/abs/2207.06405

  74. [74]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2017). https://doi.org/10.48550/arXiv.1412.6980 . https://arxiv.org/abs/1412.6980 Supplementary information Supplementary Note 1: Algorithmic details of MuQ extraction In Algorithm 1, the input waveform is transformed explicitly asx→ x1 → x2 → x3. In Algorithm 2, K-means is run with K-means...