pith. sign in

arxiv: 2509.12712 · v5 · submitted 2025-09-16 · 💻 cs.SD · cs.IR

A Lightweight Two-Branch Architecture for Multi-Instrument Transcription via Note-Level Contrastive Clustering

Pith reviewed 2026-05-18 16:42 UTC · model grok-4.3

classification 💻 cs.SD cs.IR
keywords multi-instrument transcriptioncontrastive clusteringlightweight neural networktimbre separationnote-level clusteringsource separationmusic information retrieval
0
0 comments X

The pith

A lightweight two-branch model transcribes and separates arbitrary instruments by clustering notes according to timbre.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing multi-instrument transcription systems often cannot handle new sounds, demand fixed numbers of sources, or require too much computing power for everyday devices. This paper builds a compact network that first identifies notes without caring which instrument produced them, then feeds those notes into a separate timbre encoder. Deep clustering at the note level groups the notes into the right number of instruments supplied by the user at run time. The resulting system stays small enough for fast inference on ordinary hardware while matching the accuracy of much larger models on both transcription and separation tasks.

Core claim

The model extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability.

What carries the argument

Note-level contrastive clustering, which groups individual transcribed notes by similarity of their timbre features extracted from the dedicated encoder to achieve dynamic instrument separation.

If this is right

  • The model can transcribe and separate instruments it has never encountered during training once the user states the number of classes present.
  • Only the count of instruments is needed at inference rather than fixed source identities or pre-assigned labels.
  • Spectral normalization and dilated convolutions reduce model size and speed up inference while preserving separation quality.
  • The compact design supports practical deployment on resource-limited hardware without sacrificing competitive accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same note-level clustering idea could extend to separating overlapping notes within a single instrument or to non-musical audio mixtures such as speech in noise.
  • Mobile or embedded applications for live music analysis become feasible because the network runs quickly and adapts to any instrument set.
  • Future versions might remove the need to specify the instrument count in advance by adding an automatic source-count estimator.

Load-bearing premise

The approach assumes that providing the number of instrument classes at inference time combined with note-level contrastive clustering will reliably separate arbitrary unseen timbres without the rigid source-count constraints or pre-training limitations of prior models.

What would settle it

A controlled test on music containing three or more instruments never seen during training, where the model receives the correct instrument count yet produces note assignments with substantially higher error rates than larger pre-trained baselines, would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2509.12712 by Ruigang Li, Yongxu Zhu.

Figure 1
Figure 1. Figure 1: Overview of the proposed method. Top: Overall pipeline, which takes a multi-timbral mixture au￾dio as input and outputs note events for each constituent timbre. Bottom left: AMT branch producing timbre-agnostic transcription outputs—frame activation posteriorgram YF and onset activation posterior￾gram YO. Bottom right: Timbre encoding branch yielding a D-dimensional timbre embedding Vt i for each time–freq… view at source ↗
Figure 2
Figure 2. Figure 2: Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence. Q˜ = k ·(log|Q| 2 −logσ)+b, (4) where k and b are learnable scalars. 3.4 Dilated Convolution for Harmonic Context In BasicPitch, a large convolutional kernel spanning 39 frequency bins (one octave plus one note) is ex￾plicitly mentioned as effective for mitig… view at source ↗
Figure 3
Figure 3. Figure 3: Result of frame-level and note-level post-processing for triple separation. 3.7 Post-process of timbre-separated transcription Existing deep clustering–based source separation methods are predominantly designed for spectrogram reconstruction. In the field of music transcription, however, there is a lack of discussion on methods for creating specific notes from clustering results. Tanaka et al. (2020) propo… view at source ↗
Figure 4
Figure 4. Figure 4: Randomly generated piano-roll (left) and the corresponding CQT spectrogram of the synthe￾sized audio using a trumpet timbre (right) the onset threshold to maximize note-level F1 (FN ). A note is correct if its pitch matches ground truth and its onset is within ±50 ms of the reference. For timbre-separated transcription, we extract note events following section 3.7 using the optimized thresholds. The number… view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of timbre embeddings. (a) Frame-level embeddings for BACH10 Piece 2; (b) note-level aggregates of (a); (c) frame-level for URMP Piece 18; (d) frame-level for URMP Piece 18 using top-k attention. pact and separable timbre clusters. Second, although synthesizers offer diverse instru￾ment programs, each produces a static, idealized tim￾bre. Real-world performances exhibit rich timbral vari… view at source ↗
read the original abstract

Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a lightweight two-branch neural network for multi-instrument music transcription. It augments a timbre-agnostic transcription backbone with a dedicated timbre encoder and applies note-level contrastive clustering to jointly transcribe notes and separate an arbitrary number of instruments when the instrument class count K is supplied at inference time. Practical optimizations such as spectral normalization, dilated convolutions, and contrastive clustering are introduced to improve efficiency and robustness. The central claim is that the resulting small, fast model achieves competitive transcription accuracy and separation quality relative to heavier baselines while exhibiting promising generalization to unseen timbres, making it suitable for resource-constrained real-world deployment.

Significance. If the performance and generalization claims are substantiated by rigorous experiments, the work would be significant for practical multi-instrument transcription on low-resource devices. The note-level contrastive clustering approach for dynamic separation without rigid source-count pre-training constraints offers a potentially useful direction for handling variable and unseen instrument timbres.

major comments (3)
  1. [Abstract] Abstract: the claim of 'competitive performance with heavier baselines in terms of transcription accuracy and separation quality' is asserted without any quantitative results, baselines, error bars, dataset details, or statistical tests, leaving the central empirical claim unsupported by visible evidence.
  2. [Experimental Evaluation] Experimental Evaluation: no explicit out-of-distribution instrument test set is described to support the generalization claim for arbitrary unseen timbres. Without such a test, the separation-quality results remain vulnerable to in-distribution leakage and do not directly validate the note-level contrastive clustering behavior on OOD instruments.
  3. [Ablation Studies] Ablation Studies: no ablation that removes the contrastive clustering term is reported. This omission prevents assessment of whether the timbre-discriminative note embeddings and separation quality actually depend on the contrastive objective rather than the backbone alone.
minor comments (2)
  1. [Method] Clarify the precise mechanism by which the supplied K (number of instrument classes) is used to initialize or constrain the note-level clustering at inference time.
  2. [Experiments] Provide model size (parameters), inference latency, and FLOPs comparisons against the heavier baselines to substantiate the 'lightweight' and 'fast inference' claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the identification of areas where the manuscript can be strengthened and address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'competitive performance with heavier baselines in terms of transcription accuracy and separation quality' is asserted without any quantitative results, baselines, error bars, dataset details, or statistical tests, leaving the central empirical claim unsupported by visible evidence.

    Authors: We agree that the abstract, constrained by length, does not embed specific numbers. The full manuscript reports these results in the Experimental Evaluation section, including F1 scores and SDR values against baselines such as MT3 and other multi-instrument models, with error bars from repeated runs and statistical tests on Slakh2100 and similar datasets. We will revise the abstract to incorporate the key quantitative metrics and dataset references. revision: yes

  2. Referee: [Experimental Evaluation] Experimental Evaluation: no explicit out-of-distribution instrument test set is described to support the generalization claim for arbitrary unseen timbres. Without such a test, the separation-quality results remain vulnerable to in-distribution leakage and do not directly validate the note-level contrastive clustering behavior on OOD instruments.

    Authors: This observation is correct. Our current evaluations use held-out mixtures but do not isolate a dedicated OOD instrument set. We will add an explicit out-of-distribution test set using timbres absent from training and report separation metrics on it to directly assess the note-level contrastive clustering on unseen instruments. revision: yes

  3. Referee: [Ablation Studies] Ablation Studies: no ablation that removes the contrastive clustering term is reported. This omission prevents assessment of whether the timbre-discriminative note embeddings and separation quality actually depend on the contrastive objective rather than the backbone alone.

    Authors: We acknowledge this gap. The manuscript presents the full model but omits an ablation that isolates the contrastive clustering loss. We will include this ablation in the revised version, comparing performance and embedding quality with and without the contrastive term to demonstrate its specific contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance and generalization claims rest on empirical evaluation

full rationale

The paper introduces a two-branch architecture combining a timbre-agnostic backbone with a timbre encoder and note-level contrastive clustering. Its claims of competitive transcription accuracy, separation quality, and generalization to arbitrary instruments are presented as outcomes of model design choices and training on datasets, validated through experiments against baselines. No derivation chain reduces any result to fitted inputs by construction, no self-definitional equations appear, and no load-bearing self-citations or uniqueness theorems are invoked to force the architecture. The approach is self-contained against external benchmarks via reported metrics, with the supplied instrument count K treated as an inference-time input rather than a derived prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard assumptions of deep learning optimization and the effectiveness of contrastive loss for timbre separation; no explicit free parameters, axioms, or invented entities are detailed in the abstract beyond the user-specified instrument class count.

free parameters (1)
  • number of instrument classes
    User-provided hyperparameter that controls the clustering output and is required for dynamic separation.

pith-pipeline@v0.9.0 · 5652 in / 1068 out tokens · 34442 ms · 2026-05-18T16:42:40.366412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Benetos, E., Dixon, S., Duan, Z., and Ewert, S. (2019). Automatic music transcription: An overview. IEEE Signal Processing Magazine , 36(1):20--30

  2. [2]

    M., Bosch, J

    Bittner, R. M., Bosch, J. J., Rubinstein, D., Meseguer - Brocal, G., and Ewert, S. (2022). A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022 , pages 781--785. IEEE

  3. [3]

    W., Choi, W., Ram \' rez, M

    Cwitkowitz, F., Cheuk, K. W., Choi, W., Ram \' rez, M. A. M., Toyama, K., Liao, W., and Mitsufuji, Y. (2024). Timbre-trap: A low-resource framework for instrument-agnostic music transcription. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024 , pages 1291--1295. IEEE

  4. [4]

    Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing , 18(8):2121--2133

  5. [5]

    Gardner, J., Simon, I., Manilow, E., Hawthorne, C., and Engel, J. H. (2022). MT3: multi-task multitrack music transcription. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net

  6. [6]

    R., Chen, Z., Roux, J

    Hershey, J. R., Chen, Z., Roux, J. L., and Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016 , pages 31--35. IEEE

  7. [7]

    Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences , 79(8):2554--2558

  8. [8]

    Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Research , pages 5156--5165. PMLR

  9. [9]

    Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019). Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia , 21(2):522--535

  10. [10]

    Lin, L., Kong, Q., Jiang, J., and Xia, G. (2021). A unified model for zero-shot music source separation, transcription and synthesis. In Proceedings of 22st International Conference on Music Information Retrieval, ISMIR

  11. [11]

    B., He, K., and Doll \' a r, P

    Lin, T., Goyal, P., Girshick, R. B., He, K., and Doll \' a r, P. (2017). Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 2999--3007. IEEE Computer Society

  12. [12]

    Liu, S., Johns, E., and Davison, A. J. (2019). End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 1871--1880. Computer Vision Foundation / IEEE

  13. [13]

    Luo, Y., Chen, Z., and Mesgarani, N. (2018). Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 26(4):787--796

  14. [14]

    and Mesgarani, N

    Luo, Y. and Mesgarani, N. (2018). Tasnet: Time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018 , pages 696--700. IEEE

  15. [15]

    Manilow, E., Seetharaman, P., and Pardo, B. (2020). Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020 , pages 771--775. IEEE

  16. [16]

    Manilow, E., Wichern, G., Seetharaman, P., and Roux, J. L. (2019). Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity

  17. [17]

    Miron, M., Carabias-Orti J, J., Bosch, J., G \'o mez, E., and Janer, J. (2016). Phenicx-anechoic: note annotations for aalto anechoic orchestral database. Phenicx-Anechoic: Note Annotations For Aalto Anechoic Orchestral Database

  18. [18]

    J., Salamon, J., and Ellis, D

    Raffel, C., Mcfee, B., Humphrey, E. J., Salamon, J., and Ellis, D. P. W. (2014). mir\_eval: A transparent implementation of common mir metrics. In Proceedings - 15th International Society for Music Information Retrieval Conference (ISMIR 2014)

  19. [19]

    P., Kopp, M

    Ramsauer, H., Sch \" a fl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Adler, T., Kreil, D. P., Kopp, M. K., Klambauer, G., Brandstetter, J., and Hochreiter, S. (2021). Hopfield networks is all you need. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

  20. [20]

    Rouard, S., Massa, F., and D \' e fossez, A. (2023). Hybrid transformers for music source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023 , pages 1--5. IEEE

  21. [21]

    and Klapuri, A

    Schörkhuber, C. and Klapuri, A. (2010). Constant-q transform toolbox for music processing. Proc. 7th Sound and Music Computing Conf. , pages 3--64

  22. [22]

    and Yang, Y.-H

    Su, L. and Yang, Y.-H. (2015). Combining spectral and temporal representations for multipitch estimation of polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 23(10):1600--1612

  23. [23]

    O zer, Y., M \

    Tamer, N. C., \" O zer, Y., M \" u ller, M., and Serra, X. (2023). TAPE: an end-to-end timbre-aware pitch estimator. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023 , pages 1--5. IEEE

  24. [24]

    Tanaka, K., Nakatsuka, T., Nishikimi, R., Yoshii, K., and Morishima, S. (2020). Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In ISMIR , pages 327--334

  25. [25]

    Thickstun, J., Harchaoui, Z., and Kakade, S. M. (2017). Learning features of music from scratch. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net

  26. [26]

    N., Kaiser, L., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R., editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Informatio...

  27. [27]

    Wu, H., Wu, J., Xu, J., Wang, J., and Long, M. (2022). Flowformer: Linearizing transformers with conservation flows. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv \' a ri, C., Niu, G., and Sabato, S., editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Machine Learni...