Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Pith reviewed 2026-05-21 11:32 UTC · model grok-4.3
The pith
Pretraining EEG models on acoustic and expectation ANN representations improves music identification from brain activity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models pretrained to predict either acoustic or expectation ANN representations outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. The expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch. This shows that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding.
What carries the argument
Distinct acoustic and expectation ANN representations used as separate teacher targets to pretrain EEG decoding models for music identification.
If this is right
- Pretrained models on either representation outperform non-pretrained baselines.
- Combining acoustic and expectation targets produces gains beyond those from random seed ensembles.
- Teacher representation type shapes downstream EEG decoding performance.
- Representation learning for brain signals can be guided by neural encoding principles.
- The expectation representation enables investigation of multilayer predictive encoding without manual labels.
Where Pith is reading between the lines
- Similar pretraining on distinct representation types could be tested for decoding other cognitive processes that involve prediction, such as speech from EEG.
- The method may support creation of more general-purpose EEG models that scale to larger and more varied stimulus sets.
- Future experiments could check whether these gains hold when the music stimuli or listener groups differ substantially from the training data.
Load-bearing premise
The acoustic and expectation ANN representations distinctly and accurately reflect separable components of cortical activity during music listening.
What would settle it
An experiment showing that pretraining on unrelated or randomly generated targets produces comparable gains in EEG music identification accuracy would indicate the improvements stem from general pretraining rather than the specific content of these representations.
read the original abstract
During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that pretraining EEG decoding models to predict acoustic and expectation-related representations extracted from ANNs improves music identification performance from brain activity. Pretrained models on either target outperform non-pretrained baselines, and their combination produces complementary gains that exceed those obtained from strong ensembles created by varying random initializations. The expectation representation is derived directly from raw audio signals without manual labels and is argued to capture predictive structure beyond onsets or pitch, supporting scalable, label-free investigation of multilayer predictive encoding in music cognition.
Significance. If the central empirical claims hold after appropriate controls, the work provides evidence that specific ANN-derived representations aligned with hypothesized cortical components can serve as effective teacher signals for EEG representation learning. This could support development of general-purpose neural decoding models grounded in cortical encoding principles and advance predictive models of music cognition. The label-free computation of the expectation representation and its scalability to diverse datasets are potential strengths for broader applicability.
major comments (1)
- Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.
minor comments (3)
- Methods section: Additional details are needed on EEG dataset size, number of subjects, preprocessing steps, exact ANN architectures used for target extraction, and the downstream EEG model architecture to allow full reproducibility and assessment of post-hoc choices.
- Figures and results: Performance plots and tables should include error bars, standard deviations, or statistical significance tests to quantify variability across seeds and subjects, as the abstract reports improvements without visible uncertainty measures.
- Abstract and introduction: Clarify the precise definition and computation of the 'expectation representation' early on, including how it differs from basic acoustic features like onset or pitch, to strengthen the claim of capturing multilayer predictive encoding.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on our manuscript. We address the concern about control ablations below and will incorporate the suggested experiments in revision.
read point-by-point responses
-
Referee: Results and experimental evaluation sections: The reported performance improvements and complementary gains from acoustic vs. expectation pretraining targets lack control ablations using matched-dimensionality but non-semantic signals (e.g., shuffled features or unrelated audio statistics) while keeping the pretraining procedure identical. Without these, it remains unclear whether gains arise from specific alignment with separable cortical components or from general auxiliary-task regularization and better initialization, which directly bears on the central claim that representation type shapes downstream performance via neural encoding principles.
Authors: We agree that additional controls with matched-dimensionality non-semantic signals would help isolate whether gains derive from specific representational alignment rather than generic auxiliary-task effects. Our existing comparisons to seed ensembles (varying only random initializations while holding targets fixed) already demonstrate that the combined acoustic+expectation model exceeds ensemble performance, providing evidence against purely non-specific regularization or initialization benefits. To directly address the referee's point, we will add ablations in the revised manuscript using shuffled versions of the acoustic and expectation features (preserving dimensionality) as well as unrelated audio statistics, with the pretraining procedure held identical. These will be reported alongside the original results to clarify the role of representation type. revision: yes
Circularity Check
No circularity; empirical pretraining and evaluation chain is self-contained
full rationale
The paper describes an empirical workflow of pretraining models to predict acoustic and expectation ANN representations as teacher targets, followed by downstream evaluation on EEG music identification performance. Claims rest on reported outperformance versus non-pretrained baselines and seed ensembles, with no equations, fitted parameters renamed as predictions, or self-referential definitions visible in the provided text. Prior-work citations on ANN-cortical resemblance are external and not load-bearing for the central experimental result. The derivation does not reduce to inputs by construction and remains falsifiable via the described ablations and comparisons.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018
Koelsch, S., Vuust, P., Friston, K.: Predictive processes and the peculiar case of music. Trends in Cognitive Sciences23(2019) https://doi.org/10.1016/j.tics.2018. 10.006
-
[2]
Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5
Vuust, P., Heggli, O.A., Friston, K.J., Kringelbach, M.L.: Music in the brain. Nature Reviews Neuroscience23(2022) https://doi.org/10.1038/ s41583-022-00578-5
work page 2022
-
[3]
MIT Press, Cambridge, MA (2006)
Huron, D.: Sweet Anticipation: Music and the Psychology of Expectation. MIT Press, Cambridge, MA (2006). https://doi.org/10.7551/mitpress/6575.001.0001
-
[4]
Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293
Juslin, P.N., Västfjäll, D.: Emotional responses to music: The need to consider underlying mechanisms. Behavioral and Brain Sciences31(2008) https://doi.org/ 10.1017/S0140525X08005293
-
[5]
Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001
Salimpoor, V.N., Zald, D.H., Zatorre, R.J., Dagher, A., McIntosh, A.R.: Predictions and the brain: How musical sounds become rewarding. Trends in Cognitive Sciences19(2), 86–91 (2015) https://doi.org/10.1016/j.tics.2014.12.001
-
[6]
Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20
Krumhansl, C.L.: Statistics, structure, and style in music. Music Perception33 (2015) https://doi.org/10.1525/mp.2015.33.1.20
-
[7]
Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32
Patel, A.D.: Language, music, syntax and the brain. Nature Neuroscience6(2003) https://doi.org/10.1038/nn1082 32
-
[8]
Rohrmeier, M., Rebuschat, P., Cross, I.: Incidental and online learning of melodic structure. Consciousness and Cognition20(2011) https://doi.org/10.1016/j. concog.2010.07.004
work page doi:10.1016/j 2011
-
[9]
WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262
Tillmann, B., Poulin-Charronnat, B., Bigand, E.: The role of expectation in music: from the score to emotions and the brain. WIREs Cognitive Science5(1), 105–113 (2014) https://doi.org/10.1002/wcs.1262
-
[10]
Friston, K.: The free-energy principle: a unified brain theory? Nature Reviews Neuroscience11(2010) https://doi.org/10.1038/nrn2787
-
[11]
Friston, K.J., Friston, D.A.: A Free Energy Formulation of Music Generation and Perception: Helmholtz Revisited, pp. 43–69. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00107-4_2
-
[12]
Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113
Ishida, K., Nittono, H.: Active inference in music perception: Motor engagement to syncopation modulates rhythmic prediction error. Psychophysiology62(2025) https://doi.org/10.1111/psyp.70113
-
[13]
NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019
Pearce, M.T., Ruiz, M.H., Kapasi, S., Wiggins, G.A., Bhattacharya, J.: Unsupervised statistical learning underpins computational, behavioural, and neural manifestations of musical expectation. NeuroImage50(2010) https: //doi.org/10.1016/j.neuroimage.2009.12.019
-
[14]
eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784
Di Liberto, G.M.,et al.: Cortical encoding of melodic expectations in human temporal cortex. eLife9, 51784 (2020) https://doi.org/10.7554/eLife.51784
-
[15]
Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011
Koelsch, S.: Toward a neural basis of music perception – a review and updated model. Frontiers in Psychology2, 110 (2011) https://doi.org/10.3389/fpsyg.2011. 00110
-
[16]
Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078
Koelsch, S., Jentschke, S.: Short-term effects of processing musical syntax: An ERP study. Brain Research1212(2008) https://doi.org/10.1016/j.brainres.2007.10.078
-
[17]
Behavioural Neurology2015, 469508 (2015) https://doi
Yu, X., Liu, T., Gao, D.: The mismatch negativity: An indicator of perception of regularities in music. Behavioural Neurology2015, 469508 (2015) https://doi. org/10.1155/2015/469508
-
[18]
Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023
Brattico, E., Tervaniemi, M., Näätänen, R., Peretz, I.: Musical scale properties are automatically processed in the human auditory cortex. Brain Research1117 (2006) https://doi.org/10.1016/j.brainres.2006.08.023
-
[19]
Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33
Mencke, I., Quiroga-Martinez, D.R., Omigie, D., Michalareas, G., Schwarzacher, F., Haumann, N.T., Vuust, P., Brattico, E.: Prediction under uncertainty: Dissociating sensory from cognitive expectations in highly uncertain musical contexts. Brain Research1773(2021) https://doi.org/10.1016/j.brainres.2021.147664 33
-
[20]
NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816
Quiroga-Martinez, D.R., Hansen, N.C., Højlund, A., Pearce, M., Brattico, E., Vuust, P.: Decomposing neural responses to melodic surprise in musicians and non-musicians: Evidence for a hierarchy of predictions in the auditory system. NeuroImage215(2020) https://doi.org/10.1016/j.neuroimage.2020.116816
-
[21]
PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631
Koelsch, S., Kilches, S., Steinbeis, N., Schelinski, S.: Effects of unexpected chords and of performer’s expression on brain responses and electrodermal activity. PLOS ONE3, 2631 (2008) https://doi.org/10.1371/journal.pone.0002631
-
[22]
Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024
Carrus, E., Pearce, M.T., Bhattacharya, J.: Melodic pitch expectation interacts with neural responses to syntactic but not semantic violations. Cortex49(2013) https://doi.org/10.1016/j.cortex.2012.08.024
-
[23]
Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380
Steinbeis, N., Koelsch, S., Sloboda, J.A.: The role of harmonic expectancy violations in musical emotions: Evidence from subjective, physiological, and neural responses. Journal of Cognitive Neuroscience18(2006) https://doi.org/10.1162/ jocn.2006.18.8.1380
work page 2006
-
[24]
Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010
Omigie, D., Pearce, M.T., Williamson, V.J., Stewart, L.: Electrophysiological correlates of melodic processing in congenital amusia. Neuropsychologia51(9), 1749–1762 (2013) https://doi.org/10.1016/j.neuropsychologia.2013.05.010
-
[25]
Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988
Choi, I., Bharadwaj, H.M., Bressler, S., Loui, P., Lee, K., Shinn-Cunningham, B.G.: Automatic processing of abstract musical tonality. Frontiers in Human Neuroscience8(2014) https://doi.org/10.3389/fnhum.2014.00988
-
[26]
Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047
Heacock, R.M., Pigeon, A., Chermak, G., Musiek, F., Weihing, J.: Enhancement of the auditory late response (N1-P2) by presentation of stimuli from an unexpected location. Journal of the American Academy of Audiology30(6), 451–458 (2019) https://doi.org/10.3766/jaaa.17047
-
[27]
NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034
Miranda, R.A., Ullman, M.T.: Double dissociation between rules and memory in music: An event-related potential study. NeuroImage38(2), 331–345 (2007) https://doi.org/10.1016/j.neuroimage.2007.07.034
-
[28]
Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388
Omigie, D., Pearce, M., Lehongre, K., Hasboun, D., Navarro, V., Adam, C., Samson,S.:Intracranialrecordingsandcomputationalmodelingofmusicrevealthe time course of prediction error signaling in frontal and temporal cortices. Journal of Cognitive Neuroscience31(2019) https://doi.org/10.1162/jocn_a_01388
-
[29]
European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329
Lumaca, M., Trusbak Haumann, N., Brattico, E., Grube, M., Vuust, P.: Weighting of neural prediction error by rhythmic complexity: A predictive coding account using mismatch negativity. European Journal of Neuroscience49(2019) https: //doi.org/10.1111/ejn.14329
-
[30]
Scientific Reports14(2024) https://doi
Ono, K., Mizuochi, R., Yamamoto, K., Sasaoka, T., Yamawaki, S.: Exploring the neural underpinnings of chord prediction uncertainty: an 34 electroencephalography (EEG) study. Scientific Reports14(2024) https://doi. org/10.1038/s41598-024-55366-1
-
[31]
International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009
Tanovic, E., Joormann, J.: Anticipating the unknown: The stimulus-preceding negativity is enhanced by uncertain threat. International Journal of Psychophysiology139(2019) https://doi.org/10.1016/j.ijpsycho.2019.03.009
-
[32]
eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935
Kern, P., Heilbron, M., Lange, F.P., Spaak, E.: Cortical activity during naturalistic music listening reflects short-range predictions based on long-term experience. eLife12, 80935 (2023) https://doi.org/10.7554/eLife.80935
-
[33]
European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581
Galeano-Otálvaro, J.-D., Martorell, J., Meyer, L., Titone, L.: Neural encoding of melodic expectations in music across EEG frequency bands. European Journal of Neuroscience60(11), 6734–6749 (2024) https://doi.org/10.1111/ejn.16581
-
[34]
Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7
Mischler, G., Li, Y.A., Bickel, S., Mehta, A.D., Mesgarani, N.: The impact of musical expertise on disentangled and contextual neural encoding of music revealed by generative music models. Nature Communications16, 8874 (2025) https://doi.org/10.1038/s41467-025-63961-7
-
[35]
PLOS Biology21, 3002176 (2023) https://doi.org/10
Bellier, L., Llorens, A., Marciano, D., Gunduz, A., Schalk, G., Brunner, P., Knight, R.T.: Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLOS Biology21, 3002176 (2023) https://doi.org/10. 1371/journal.pbio.3002176
work page 2023
-
[36]
PLOS Biology21(2023) https://doi
Tuckute, G., Feather, J., Boebinger, D., McDermott, J.H.: Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLOS Biology21(2023) https://doi. org/10.1371/journal.pbio.3002366
-
[37]
Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6
Akama, T., Zhang, Z., Li, P., Hongo, K., Minamikawa, S., Polouliakh, N.: Predicting artificial neural network representations to learn recognition model for music identification from brain recordings. Scientific Reports15, 18869 (2025) https://doi.org/10.1038/s41598-025-02790-6
-
[38]
Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x
Daly, I.: Neural decoding of music from the EEG. Scientific Reports13(2023) https://doi.org/10.1038/s41598-022-27361-x
-
[39]
Hu- man Brain Mapping (aug 2017)
Schirrmeister, R.T., Springenberg, J.T., Fiederer, L.D.J., Glasstetter, M., Eggensperger, K., Tangermann, M., Hutter, F., Burgard, W., Ball, T.: Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping38(2017) https://doi.org/10.1002/hbm.23730
-
[40]
Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35
Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., King, J.-R.: Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence5 (2023) https://doi.org/10.1038/s42256-023-00714-5 35
-
[41]
Losorelli, S., Nguyen, D.T., Dmochowski, J.P., Kaneshiro, B.: NMED-T: A tempo-focused dataset of cortical and behavioral responses to naturalistic music. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) (2017). https://exhibits.stanford.edu/data/catalog/ jn859kj8079
work page 2017
-
[42]
Zhu, H., Zhou, Y., Chen, H., Yu, J., Ma, Z., Gu, R., Luo, Y., Tan, W., Chen, X.: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization (2025). https://doi.org/10.48550/arXiv.2501.01108 . https: //arxiv.org/abs/2501.01108
-
[43]
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., Adi, Y., Défossez, A.: Simple and controllable music generation. In: Advances in Neural Information Processing Systems (2023). https://doi.org/10.48550/arXiv.2306.05284
-
[44]
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles (2017). https://doi.org/10.48550/ arXiv.1612.01474 . https://arxiv.org/abs/1612.01474
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,
Fort, S., Hu, H., Lakshminarayanan, B.: Deep Ensembles: A Loss Landscape Perspective (2019). https://doi.org/10.48550/arXiv.1912.02757 . https://arxiv. org/abs/1912.02757
-
[46]
High Fidelity Neural Audio Compression
Défossez, A., Copet, J., Synnaeve, G., Adi, Y.: High Fidelity Neural Audio Compression (2022). https://doi.org/10.48550/arXiv.2210.13438 . https://arxiv. org/abs/2210.13438
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.13438 2022
-
[47]
Bogdanov, D., Won, M., Tovstogan, P., Porter, A., Serra, X.: The MTG-Jamendo Dataset for Automatic Music Tagging. In: Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019), Long Beach, CA, United States (2019). https://doi.org/10.5281/zenodo.3826813
-
[48]
In: Advances in Neural Information Processing Systems, vol
Millet, J., Caucheteux, C., Orhan, P., Boubenec, Y., Gramfort, A., Dunbar, E., Pallier, C., King, J.-R.: Toward a realistic model of speech processing in the brain with self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 35, pp. 33428–33443 (2022). https://proceedings.neurips.cc/paper_ files/paper/2022/file/d81ecfc8fb18e8...
work page 2022
-
[49]
https://doi.org/10.48550/ arXiv.2205.14252
Vaidya, A.R., Jain, S., Huth, A.G.: Self-supervised models of audio effectively explain human cortical responses to speech (2022). https://doi.org/10.48550/ arXiv.2205.14252 . https://arxiv.org/abs/2205.14252
-
[50]
In: International Conference on Learning Representations (ICLR) (2019)
Huang, C.-Z.A., Vaswani, A., Uszkoreit, J., Shazeer, N., Simon, I., Hawthorne, C., Dai, A.M., Hoffman, M.D., Dinculescu, M., Eck, D.: Music transformer: Generating music with long-term structure. In: International Conference on Learning Representations (ICLR) (2019). https://doi.org/10.48550/arXiv.1809. 04281 36
-
[51]
Audiolm: a language modeling approach to audio generation, 2023
Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., Zeghidour, N.: AudioLM: a language modeling approach to audio generation. arXiv preprint (2022) https: //doi.org/10.48550/arXiv.2209.03143 arXiv:2209.03143
-
[52]
Jukebox: A Generative Model for Music
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020) https: //doi.org/10.48550/arXiv.2005.00341
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.00341 2005
-
[53]
MusicLM: Generating Music From Text
Agostinelli, A., Denk, T.I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., Sharifi, M., Zeghidour, N., Frank, C.: Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023) https://doi.org/10.48550/arXiv.2301.11325
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2301.11325 2023
-
[54]
Disentangled representation learning, 2024
Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) https://doi.org/10.48550/arXiv.2211.11695
-
[55]
https://doi.org/10.48550/arXiv.2205.14540
Liang, F., Li, Y., Marculescu, D.: SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners (2024). https://doi.org/10.48550/arXiv.2205.14540 . https://arxiv.org/abs/2205.14540
-
[56]
Journal of Machine Learning Research12, 2825–2830 (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research12, 2825–2830 (2011)
work page 2011
-
[57]
Oota, S.R., Pahwa, K., Marreddy, M., Gupta, M., Raju, B.S.: Neural architecture of speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023). https://doi.org/10.1109/ICASSP49357.2023. 10096248
-
[58]
https://doi.org/10.48550/arXiv.2307.10246
Oota, S.R., Chen, Z., Gupta, M., Bapi, R.S., Jobard, G., Alexandre, F., Hinaut, X.: Deep Neural Networks and Brain Alignment: Brain Encoding and Decoding (Survey) (2024). https://doi.org/10.48550/arXiv.2307.10246 . https://arxiv.org/ abs/2307.10246
-
[59]
Mert: Acoustic music understanding model with large-scale self-supervised training,
Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., Gyenge, N., Dannenberg, R., Liu, R., Chen, W., Xia, G., Shi, Y., Huang, W., Wang, Z., Guo, Y., Fu, J.: MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training (2024). https: //doi.org/10.48550/arXiv.2306.00107 . https://arxiv.org/a...
-
[60]
https://doi.org/10.48550/arXiv.2311.03318
Won, M., Hung, Y.-N., Le, D.: A Foundation Model for Music Informatics (2023). https://doi.org/10.48550/arXiv.2311.03318 . https://arxiv.org/abs/2311.03318
-
[61]
https://doi.org/10.48550/arXiv.2306.10548
Yuan, R., Ma, Y., Li, Y., Zhang, G., Chen, X., Yin, H., Zhuo, L., Liu, Y., Huang, J., Tian, Z., Deng, B., Wang, N., Lin, C., Benetos, E., Ragni, A., Gyenge, N., 37 Dannenberg, R., Chen, W., Xia, G., Xue, W., Liu, S., Wang, S., Liu, R., Guo, Y., Fu, J.: MARBLE: Music Audio Representation Benchmark for Universal Evaluation (2023). https://doi.org/10.48550/a...
-
[62]
Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J.Y., Ellis, D.P.W.: MuLan: A Joint Embedding of Music Audio and Natural Language (2022). https://doi.org/10. 48550/arXiv.2208.12415 . https://arxiv.org/abs/2208.12415
-
[63]
Elizalde, B., Deshmukh, S., Wang, H.: Natural Language Supervision for General-Purpose Audio Representations (2024). https://doi.org/10.48550/arXiv. 2309.05767 . https://arxiv.org/abs/2309.05767
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[64]
https://doi.org/10.5281/zenodo.15006942
McFee, B., et al.: Librosa 0.11.0. https://doi.org/10.5281/zenodo.15006942 . https: //doi.org/10.5281/zenodo.15006942
-
[65]
In: Burstein, J., Doran, C., Solorio, T
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
work page 2019
-
[66]
OpenAI technical report (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving Language Understanding by Generative Pre-Training. OpenAI technical report (2018). https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf
work page 2018
-
[67]
OpenAI technical report (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners. OpenAI technical report (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf
work page 2019
-
[68]
Masked Autoencoders Are Scalable Vision Learners
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked Autoencoders Are Scalable Vision Learners (2021). https://doi.org/10.48550/arXiv.2111.06377 . https://arxiv.org/abs/2111.06377
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.06377 2021
-
[69]
BEiT: BERT Pre-Training of Image Transformers
Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT Pre-Training of Image Transformers (2022). https://doi.org/10.48550/arXiv.2106.08254 . https://arxiv. org/abs/2106.08254
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.08254 2022
-
[70]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: SimMIM: A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022). https: //doi.org/10.48550/arXiv.2111.09886 38
-
[71]
In: International Conference on Learning Representations (ICLR) (2024)
Jiang, W.-B., Zhao, L.-M., Lu, B.-L.: Large brain model for learning generic representations with tremendous EEG data in BCI. In: International Conference on Learning Representations (ICLR) (2024). https://openreview.net/group?id= ICLR.cc/2024/Conference
work page 2024
-
[72]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision (2021). https://doi.org/10. 48550/arXiv.2103.00020 . https://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C.: Masked Autoencoders that Listen (2023). https://doi.org/10. 48550/arXiv.2207.06405 . https://arxiv.org/abs/2207.06405
-
[74]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2017). https://doi.org/10.48550/arXiv.1412.6980 . https://arxiv.org/abs/1412.6980 Supplementary information Supplementary Note 1: Algorithmic details of MuQ extraction In Algorithm 1, the input waveform is transformed explicitly asx→ x1 → x2 → x3. In Algorithm 2, K-means is run with K-means...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.