pith. sign in

arxiv: 2605.16181 · v1 · pith:GS2EE6XUnew · submitted 2026-05-15 · 💻 cs.SD

ARIA: A Diagnostic Framework for Music Training Data Attribution

Pith reviewed 2026-05-19 18:20 UTC · model grok-4.3

classification 💻 cs.SD
keywords training data attributionmusic generationcopyright analysisreliability diagnosticssymbolic musicaudio musicinfluence decompositionscore matrix analysis
0
0 comments X

The pith

ARIA decomposes music training data attribution into specific musical aspects and validates methods using reliability diagnostics that match ground truth rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing attribution methods for music generation models give only a single number for how much a training song influenced an output. ARIA instead breaks that influence down along distinct musical aspects such as melody or harmony for symbolic music and timbre or rhythm for audio. It then runs three reliability checks on the resulting score matrix: how similar the top attributed tracks are to each other, the structure revealed by singular value decomposition, and basic column statistics. When tested on a symbolic model whose true influences are known from retraining without certain songs, these checks rank four different attribution methods in exactly the same order as the ground-truth retraining does. The result gives per-aspect evidence that aligns with the idea-expression distinction used in copyright analysis.

Core claim

The paper claims that pairing aspect-decomposed attribution scores with reliability diagnostics computed from the segment-level score matrix produces a diagnostic framework that ranks attribution methods identically to ground-truth rankings obtained by counterfactual retraining on a symbolic-music model, while also exposing substantial differences in behavior across methods on an audio generation model and characterizing embedding baselines by the musical aspect each encoder emphasizes.

What carries the argument

ARIA framework that decomposes attribution scores along five musical aspects for symbolic music or three for audio and applies reliability diagnostics including within-group similarity of top-K tracks, singular value decomposition of the score matrix, and column statistics.

If this is right

  • Attribution reports can list influence separately for each musical aspect instead of a single scalar.
  • Reliability diagnostics can serve as an objective way to compare new attribution methods against existing ones.
  • Score matrices that return nearly identical tracks for every query can be flagged as failing to reflect query-specific influence.
  • Embedding-similarity baselines can be characterized by which musical aspect their encoder tends to surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could use the per-aspect breakdowns to audit training data for specific stylistic borrowings before release.
  • The same diagnostic structure might be adapted to other generative domains once domain-specific aspects are defined.
  • Courts or rights holders could request aspect-level attribution reports when assessing whether a generated work copies protectable expression.

Load-bearing premise

The chosen musical aspects and the three reliability diagnostics together capture the dimensions of influence that matter for both model behavior and copyright analysis.

What would settle it

On the symbolic-music model, if the reliability diagnostics ranked the four attribution methods in an order different from the order produced by counterfactual retraining without each song, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.16181 by Ashkan Panahi, Changheon Han, K{\i}van\c{c} Tatar.

Figure 1
Figure 1. Figure 1: Homogeneity z¯c across K for GRAD-COS and LOGRA at coarse and fine, contrasting the largest and smallest r1 regimes among non-semantic attribution settings. Dashed curves use the original S seg, solid curves use the rank-1 residual. COS coarse is the clearest case (r1=1.000, r2:5=0.000), producing z¯c=+29.56 on timbral and +8.56 on harmonic, both the largest values across all 15 settings. Rank-1 removal fl… view at source ↗
Figure 2
Figure 2. Figure 2: Within-group homogeneity z¯c across K on MusicTransformer + MAESTRO, one panel per jSymbolic channel with the four attribution methods overlaid. Dashed curves use the original S seg , solid curves use the rank-1 residual. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Homogeneity z¯c across K for the eight attribution stage-method settings not shown in [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Homogeneity z¯c across K for the three embedding-based retrieval baselines. Curve conventions as in [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, without revealing which musical aspects are dominant in that influence. We propose ARIA, a framework that decomposes attribution along musical aspects (five for symbolic music, three for audio) and pairs the decomposition with reliability diagnostics computed from the segment-level score matrix. It measures within-group similarity among the top-K attributed tracks against random reference groups drawn from the training pool, and diagnoses the score matrix through its singular value decomposition and column statistics. On a symbolic-music model where attribution ground truth is available through counterfactual retraining, the reliability diagnostics rank four attribution methods identically to that ground truth. On an audio music generation model, ARIA reveals attribution behaviors that vary substantially across TDA methods, flags score matrices whose retrieved tracks are nearly identical across queries rather than reflecting per-query attribution, and characterizes embedding-similarity retrieval baselines by the musical aspect each encoder surfaces. Together, ARIA produces per-aspect attribution evidence aligned with the musical aspects considered under the idea-expression distinction in copyright analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ARIA, a framework for training data attribution (TDA) in music generation models. It decomposes attribution scores along musical aspects (five for symbolic music, three for audio) and augments them with reliability diagnostics computed from the segment-level score matrix, specifically within-group similarity of top-K attributed tracks versus random references, singular value decomposition, and column statistics. On a symbolic-music model, these diagnostics produce the same ranking of four TDA methods as ground truth obtained via counterfactual retraining; on an audio model, ARIA is used to characterize differences across TDA methods and to flag cases where retrieved tracks are nearly identical across queries.

Significance. If the central validation holds, ARIA supplies a needed tool for aspect-specific attribution analysis in generative music, directly relevant to copyright questions under the idea-expression distinction. The explicit use of counterfactual retraining to obtain ground truth and the introduction of matrix-based diagnostics constitute concrete strengths that move beyond scalar TDA scores.

major comments (1)
  1. [Symbolic-music validation experiment] Symbolic-music validation experiment (as described in the abstract): the claim that the reliability diagnostics rank the four attribution methods identically to ground truth rests on counterfactual retraining, yet the manuscript reports no variance across random seeds, no stability checks on convergence, and no quantification of how much output differences arise from optimizer noise versus track removal. Because neural training is stochastic, the measured ground-truth ranking itself may be confounded, weakening the assertion that the diagnostics correctly recover a reliable ordering.
minor comments (2)
  1. The selection process for the five symbolic and three audio musical aspects, and the precise construction of the segment-level score matrix, are not described with sufficient detail to determine whether they were fixed before seeing results or how they map to dimensions of influence.
  2. Clarify the exact definition of 'within-group similarity' (e.g., distance metric, choice of K, and sampling procedure for random reference groups) so that the diagnostic can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the single major comment below and agree that additional analysis of training stochasticity will strengthen the validation section.

read point-by-point responses
  1. Referee: [Symbolic-music validation experiment] Symbolic-music validation experiment (as described in the abstract): the claim that the reliability diagnostics rank the four attribution methods identically to ground truth rests on counterfactual retraining, yet the manuscript reports no variance across random seeds, no stability checks on convergence, and no quantification of how much output differences arise from optimizer noise versus track removal. Because neural training is stochastic, the measured ground-truth ranking itself may be confounded, weakening the assertion that the diagnostics correctly recover a reliable ordering.

    Authors: We agree that neural training stochasticity is a legitimate concern for the counterfactual-retraining ground truth and that the original manuscript does not report variance across seeds or quantify optimizer noise versus track-removal effects. In the experiments we performed, a single fixed seed was used for reproducibility, and the ranking of the four TDA methods remained identical to the diagnostics across the reported runs. To strengthen the claim, we will add a new subsection that repeats the counterfactual retraining for three independent random seeds, reports the resulting variance in the ground-truth ranking, and compares the magnitude of output changes attributable to seed variation versus track removal. This revision will make the validation more robust without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes ARIA to decompose attribution scores along musical aspects and apply independent reliability diagnostics (within-group similarity to random groups, SVD of the score matrix, and column statistics). The key empirical claim is that these diagnostics produce the same ranking of four TDA methods as an external ground truth obtained by counterfactual retraining on a symbolic model. No equations, fitted parameters, or self-citations are described that would make the diagnostics or rankings equivalent to the inputs by construction. The validation relies on an independent retraining procedure rather than reducing to a definitional or fitted equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that musical influence can be decomposed into a small number of discrete aspects and that the proposed diagnostics are valid proxies for attribution quality; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Musical influence operates along a small set of discrete, identifiable aspects that align with copyright-relevant distinctions.
    The framework defines five aspects for symbolic music and three for audio without further justification in the abstract.

pith-pipeline@v0.9.0 · 5746 in / 1307 out tokens · 44993 ms · 2026-05-19T18:20:24.981741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  2. [2]

    Towards tracing knowledge in language models back to the training data

    Ekin Akyurek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu. Towards tracing knowledge in language models back to the training data. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2429–2446, Abu Dhabi, United Arab Emirates, D...

  3. [3]

    Exploring musical roots: Applying audio embeddings to empower influence attribution for a generative music model

    Julia Barnett, Hugo Flores Garcia, and Bryan Pardo. Exploring musical roots: Applying audio embeddings to empower influence attribution for a generative music model. InProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

  4. [4]

    Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan Pablo Bello

    Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan Pablo Bello. Deep salience representa- tions for F0 estimation in polyphonic music. InProceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pages 63–70, 2017

  5. [5]

    AudioLM: A language modeling approach to audio generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. AudioLM: A language modeling approach to audio generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023

  6. [6]

    Quantifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramèr, and Chiyuan Zhang. Quantifying memorization across neural language models. InInternational Conference on Learning Representations (ICLR), 2023

  7. [7]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021

  8. [8]

    Extracting training data from diffusion models

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023

  9. [9]

    Input similarity from the neural network perspective.Advances in Neural Information Processing Systems, 32, 2019

    Guillaume Charpiat, Nicolas Girard, Loris Felardos, and Yuliya Tarabalka. Input similarity from the neural network perspective.Advances in Neural Information Processing Systems, 32, 2019

  10. [10]

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Baker Grosse, and Eric P. Xing. What is your data worth to GPT? LLM-scale data valuation with influence functions. InThe Thirty-ninth Annual Conference on Neural Information Processing ...

  11. [11]

    Large-scale training data attribution for music generative models via unlearning.arXiv preprint arXiv:2506.18312, 2025

    Woosung Choi, Junghyun Koo, Kin Wai Cheuk, Joan Serrà, Marco A Martínez-Ramírez, Yukara Ikemiya, Naoki Murata, Yuhta Takida, Wei-Hsiang Liao, and Yuki Mitsufuji. Large-scale training data attribution for music generative models via unlearning.arXiv preprint arXiv:2506.18312, 2025

  12. [12]

    Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980

  13. [13]

    FMA: A dataset for music analysis

    Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. FMA: A dataset for music analysis. InProceedings of the 18th International Society for Music Information Retrieval Conference, pages 316–323, 2017

  14. [14]

    Junwei Deng, Xirui Jiang, Shiyuan Zhang, Shichang Zhang, Himabindu Lakkaraju, Ruijiang Gao, Chris Donahue, and Jiaqi W. Ma. Computational copyright: Towards a royalty model for music generative AI. arXiv preprint arXiv:2312.06646, 2023

  15. [15]

    dattri: A library for efficient data attribution

    Junwei Deng, Ting-Wei Li, Shiyuan Zhang, Shixuan Liu, Yijun Pan, Hao Huang, Xinhe Wang, Pingbang Hu, Xingjian Zhang, and Jiaqi Ma. dattri: A library for efficient data attribution. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 136763–136781....

  16. [16]

    Dornis and Sebastian Stober

    Tim W. Dornis and Sebastian Stober. Generative AI training and copyright law.Transactions of the International Society for Music Information Retrieval, 2025. arXiv:2502.15858. 10

  17. [17]

    CLAP: Learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: Learning audio concepts from natural language supervision. InICASSP 2023 – IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5, 2023

  18. [18]

    Flowsynth: simplifying complex audio generation through explorable latent spaces with normalizing flows

    Philippe Esling, Naotake Masuda, and Axel Chemla-Romeu-Santos. Flowsynth: simplifying complex audio generation through explorable latent spaces with normalizing flows. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 5273–5275, 2021

  19. [19]

    Matrix computations 3rd edition.The John Hopkins University, Baltimore, 1996

    Gene H Golub and Charles F Van Loan. Matrix computations 3rd edition.The John Hopkins University, Baltimore, 1996

  20. [20]

    Detecting harmonic change in musical audio

    Christopher Harte, Mark Sandler, and Martin Gasser. Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, pages 21–26, 2006

  21. [21]

    Enabling factorized piano music modeling and generation with the MAESTRO dataset

    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations, 2019

  22. [22]

    A functional taxonomy of music generation systems.ACM Computing Surveys (CSUR), 50(5):1–30, 2017

    Dorien Herremans, Ching-Hua Chuan, and Elaine Chew. A functional taxonomy of music generation systems.ACM Computing Surveys (CSUR), 50(5):1–30, 2017

  23. [23]

    Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking

    Mojtaba Heydari, Frank Cwitkowitz, and Zhiyao Duan. Beatnet: Crnn and particle filtering for online joint beat downbeat and meter tracking. 2021

  24. [24]

    Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, and Jiaqi W. Ma. GraSS: Scalable data attribution with gradient sparsification and sparse projection.arXiv preprint arXiv:2505.18976, 2025

  25. [25]

    Music transformer: Generating music with long-term structure

    Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. InInternational Conference on Learning Representations, 2019

  26. [26]

    Datamodels: Predicting predictions from training data

    Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data. InICML, 2022

  27. [27]

    No encore: Unlearning as opt-out in music generation

    Jinju Kim, Taehan Kim, Abdul Waheed, Jong Hwan Ko, and Rita Singh. No encore: Unlearning as opt-out in music generation. InNeurIPS 2025 Workshop on AI for Music, 2025

  28. [28]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofICML ’17, pages 1885–1894, 2017

  29. [29]

    Disentangled multidimensional metric learning for music similarity

    Jongpil Lee, Nicholas J Bryan, Justin Salamon, Zeyu Jin, and Juhan Nam. Disentangled multidimensional metric learning for music similarity. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10. IEEE, 2020

  30. [30]

    Metric learning vs classification for disentangled music representation learning

    Jongpil Lee, Nicholas J Bryan, Justin Salamon, Zeyu Jin, and Juhan Nam. Metric learning vs classification for disentangled music representation learning. InThe 21th International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval, 2020

  31. [31]

    Thirteen ways to look at the correlation coefficient.The American Statistician, 42(1):59–66, 1988

    Joseph Lee Rodgers and W Alan Nicewander. Thirteen ways to look at the correlation coefficient.The American Statistician, 42(1):59–66, 1988

  32. [32]

    Mert: Acoustic music understanding model with large-scale self-supervised training,

    Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al. MERT: Acoustic music understanding model with large-scale self-supervised training.arXiv preprint arXiv:2306.00107, 2023

  33. [33]

    Copyright infringement of music: Determining whether what sounds alike is alike.Vanderbilt Journal of Entertainment and Technology Law, 15(2):227–294, 2013

    Margit Livingston and Joseph Urbinato. Copyright infringement of music: Determining whether what sounds alike is alike.Vanderbilt Journal of Entertainment and Technology Law, 15(2):227–294, 2013

  34. [34]

    Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders

    Yin-Jyun Luo, Kat Agres, and Dorien Herremans. Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders. InProceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), 2019

  35. [35]

    K. V . Mardia, J. T. Kent, and J. M. Bibby.Multivariate Analysis. Academic Press, London, 1979

  36. [36]

    Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in Python. InProceedings of the 14th Python in Science Conference, pages 18–25, 2015. 11

  37. [37]

    Cory McKay, Julie Cumming, and Ichiro Fujinaga. jSymbolic 2.2: Extracting features from symbolic music for use in musicological and MIR research.Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 348–354, 2018

  38. [38]

    Morreale, W

    Fabio Morreale, Wiebke Hutiri, Joan Serrà, Alice Xiang, and Yuki Mitsufuji. Attribution-by-design: Ensuring inference-time provenance in generative music systems.arXiv preprint arXiv:2510.08062, 2025

  39. [39]

    Court decisions on music plagiarism and the predictive value of similarity algorithms.Musicae Scientiae, 13(1_suppl):257–295, 2009

    Daniel Müllensiefen and Marc Pendzich. Court decisions on music plagiarism and the predictive value of similarity algorithms.Musicae Scientiae, 13(1_suppl):257–295, 2009

  40. [40]

    Springer, 2015

    Meinard Müller.Fundamentals of Music Processing. Springer, 2015

  41. [41]

    Harmonizing music theory and music law.Iowa Law Review, 108:1247–1313, 2023

    Peter Nicolas. Harmonizing music theory and music law.Iowa Law Review, 108:1247–1313, 2023

  42. [42]

    A folk musician became a target for AI fakes and a copyright troll

    Terrence O’Brien. A folk musician became a target for AI fakes and a copyright troll. The Verge, April 2026

  43. [43]

    TRAK: Attributing model behavior at scale

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: Attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning, pages 27074–27113, 2023

  44. [44]

    Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution

    Yonghyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Woosung Choi, Kin Wai Cheuk, Junghyun Koo, and Yuki Mitsufuji. Concept-TRAK: Understanding how diffusion models learn concepts through concept-level attribution. InInternational Conference on Learning Representations (ICLR), 2026

  45. [45]

    Estimating training data influence by tracing gradient descent

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems, volume 33, pages 19920–19930, 2020

  46. [46]

    Justin Salamon, Emilia Gómez, Daniel P. W. Ellis, and Gaël Richard. Melody extraction from polyphonic music signals: Approaches, applications, and challenges.IEEE Signal Processing Magazine, 31(2):118– 134, 2014

  47. [47]

    Constant-q transform toolbox for music processing

    Christian Schörkhuber and Anssi Klapuri. Constant-q transform toolbox for music processing. In7th sound and music computing conference, Barcelona, Spain, pages 3–64. SMC, 2010

  48. [48]

    Supervised contrastive learning from weakly-labeled audio segments for musical version matching

    Joan Serrà, R.Õguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji. Supervised contrastive learning from weakly-labeled audio segments for musical version matching. InInternational Conference on Machine Learning (ICML), 2025

  49. [49]

    Diffusion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6048–6058, 2023

  50. [50]

    Latent timbre synthesis: Audio-based variational auto-encoders for music composition and sound design applications.Neural Computing and Applications, 33(1):67–84, 2021

    Kıvanç Tatar, Daniel Bisig, and Philippe Pasquier. Latent timbre synthesis: Audio-based variational auto-encoders for music composition and sound design applications.Neural Computing and Applications, 33(1):67–84, 2021

  51. [51]

    UMG Recordings v. Suno. Complaint, UMG Recordings, Inc. v. Suno, Inc., no. 1:24-cv-11611 (D. Mass. 2024), 2024

  52. [52]

    UMG Recordings v. Udio. Complaint, UMG Recordings, Inc. v. Uncharted Labs, Inc., no. 1:24-cv-04777 (S.D.N.Y . 2024), 2024

  53. [53]

    Efros, Jun-Yan Zhu, and Richard Zhang

    Sheng-Yu Wang, Alexei A. Efros, Jun-Yan Zhu, and Richard Zhang. Evaluating data attribution for text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7192–7203, 2023

  54. [54]

    Data attribution for text-to-image models by unlearning synthesized images.Advances in Neural Information Processing Systems, 37:4235–4266, 2024

    Sheng-Yu Wang, Aaron Hertzmann, Alexei A Efros, Jun-Yan Zhu, and Richard Zhang. Data attribution for text-to-image models by unlearning synthesized images.Advances in Neural Information Processing Systems, 37:4235–4266, 2024

  55. [55]

    Self-supervised disentanglement of harmonic and rhythmic features in music audio signals

    Yiming Wu. Self-supervised disentanglement of harmonic and rhythmic features in music audio signals. arXiv preprint arXiv:2309.02796, 2023

  56. [56]

    Omnizart: A general toolbox for automatic music transcription.Journal of Open Source Software, 6(68):3391, 2021

    Yu-Te Wu, Yin-Jyun Luo, Tsung-Ping Chen, I-Chieh Wei, Jui-Yang Hsu, Yi-Chin Chuang, and Li Su. Omnizart: A general toolbox for automatic music transcription.Journal of Open Source Software, 6(68):3391, 2021. 12 A The Definitions and Formulations of Methods in ARIA A.1 Attribution Method Formulations Each attribution method assigns a real-valued score to e...