pith. sign in

arxiv: 2605.19752 · v1 · pith:XN7JKINTnew · submitted 2026-05-19 · 💻 cs.LG

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

Pith reviewed 2026-05-20 07:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords metabolite identificationmass spectrometrymolecule retrievalcontrastive learningrepresentation alignmentfoundation modelsmultimodal learningmetabolomics
0
0 comments X

The pith

Aligning frozen foundation models for mass spectra and molecules via lightweight projections and contrastive learning improves retrieval of metabolite structures from spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a straightforward method for aligning pre-trained models of mass spectra and chemical molecules can lead to better performance in identifying metabolites from their mass spectrometry data. This matters because accurate metabolite identification is crucial for applications like drug discovery and environmental analysis. The approach uses simple multilayer perceptron projections on top of frozen foundation models, trained with a contrastive objective based on candidate molecules. It is presented as easier to implement and faster to train than existing methods while achieving higher accuracy across benchmarks. Additionally, the work examines how different data splitting strategies affect evaluation by measuring distribution shift.

Core claim

MSAlign learns a shared representation space by aligning two frozen foundation models through lightweight MLP projections trained with a candidate-based contrastive objective, leading to consistent outperformance over existing approaches in molecule retrieval from mass spectra.

What carries the argument

MSAlign, the method that aligns frozen foundation models for mass spectra and molecules using lightweight MLP projections and candidate-based contrastive training to create a shared representation space for improved retrieval.

If this is right

  • MSAlign is simple to implement and fast to train compared to prior methods.
  • It consistently outperforms existing approaches across all benchmarks for molecule retrieval.
  • The candidate-based contrastive objective enables effective alignment without joint fine-tuning of the foundation models.
  • Quantifying distribution shift provides a way to evaluate and improve data splitting strategies in retrieval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach indicates that lightweight alignment can suffice for multimodal tasks in chemistry instead of full model retraining.
  • The technique might extend to aligning other types of spectral and structural data in related scientific domains.
  • Releasing unified code and splits could standardize comparisons and reduce implementation barriers in metabolomics research.

Load-bearing premise

The frozen foundation models for mass spectra and molecules already encode sufficiently rich and compatible features that lightweight MLP projections plus candidate-based contrastive training can reliably improve retrieval without needing joint fine-tuning or suffering from distribution shift in real-world candidate sets.

What would settle it

Observing that MSAlign does not outperform baselines on a benchmark with substantial distribution shift between the candidate sets used in training and real-world conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19752 by Camille Lan\c{c}on, Charlotte Laclau, Etienne Th\'evenot, Florence d'Alch\'e-Buc, Gabriel Melo, Paul Krzakala, R\'emi Flamary.

Figure 1
Figure 1. Figure 1: Lightweight alignment of unimodal foundation models via candidate contrastive learning. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of the MassSpecGym MCES splits. Possible strategies. A variety of splitting strategies have been proposed in the literature, most aiming to ensure that test molecules differ sufficiently from those in the training set, with various definitions of “dissimilarity”. The MCES split of MassSpecGym [52] is based on the Maximum Com￾mon Edge Subgraph (MCES) similarity and enforce enforces a min… view at source ↗
Figure 4
Figure 4. Figure 4: Similarities between pairs of candi￾dates. In the ChemBERTa space, candidates are close to each other (average is µpc = 0.63), whereas MSAlign learns a space where they are easier to distinguish (µpc = −0.06)). proposed to align pretrained vision and language models [69], but it can be directly adapted to our setting for aligning molecules (ChemBERTa) and mass spectra (DreaMS); following the original work,… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of scaling the effective batch size on Spectraverse performances. For MSAlign the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization of all MassSpecGym splits. We sample 2000 samples from the train [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MSAlign, a lightweight method to align frozen DreaMS (mass spectra) and ChemBERTa (molecules) foundation models via MLP projections trained with a candidate-based contrastive objective for metabolite identification from MS/MS spectra. It also proposes a quantitative distribution-shift metric to analyze data-splitting strategies, critiques existing benchmarks for trading leakage against shift, and releases unified code, datasets, splits, and baseline implementations.

Significance. If the central results hold, MSAlign shows that simple alignment of independently pre-trained models can deliver consistent retrieval gains without joint fine-tuning, providing an efficient and reproducible approach. The distribution-shift metric addresses a persistent evaluation issue in molecule retrieval. The public release of all datasets, splits, candidate sets, and a unified implementation framework is a clear strength that supports reproducibility and future work.

major comments (2)
  1. [§4] §4 (experimental results): the central claim of consistent outperformance across all benchmarks is presented without error bars, standard deviations, or statistical significance tests over multiple random seeds or runs; this makes it difficult to assess whether the reported gains over baselines are robust.
  2. [§5] §5 (distribution shift analysis): the paper introduces a quantitative shift measure and explicitly notes that existing splits trade leakage against domain shift, yet the main benchmark results rely on the very splits critiqued in this section; this creates a tension with the claim that gains will hold for realistic candidate sets (e.g., PubChem or HMDB) that may exhibit larger shifts.
minor comments (2)
  1. [Figure 1] The architecture diagram (Figure 1 or 2) would benefit from explicit notation of the projection dimensions and the exact form of the contrastive loss to improve clarity for readers implementing the method.
  2. [Table 2] Table 2 (or equivalent results table) lists baseline comparisons but does not indicate which components of the unified framework were used for each baseline; adding a short column or footnote would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and describe the revisions we will make to improve the robustness of the reported results and to clarify the relationship between our benchmark evaluations and the distribution-shift analysis.

read point-by-point responses
  1. Referee: [§4] §4 (experimental results): the central claim of consistent outperformance across all benchmarks is presented without error bars, standard deviations, or statistical significance tests over multiple random seeds or runs; this makes it difficult to assess whether the reported gains over baselines are robust.

    Authors: We agree that reporting variability across runs would strengthen the central claims. In the revised manuscript we will rerun all experiments with at least five random seeds, report mean performance together with standard deviations for every metric and baseline, and add paired statistical significance tests (e.g., Wilcoxon signed-rank) between MSAlign and the strongest baselines on each benchmark. revision: yes

  2. Referee: [§5] §5 (distribution shift analysis): the paper introduces a quantitative shift measure and explicitly notes that existing splits trade leakage against domain shift, yet the main benchmark results rely on the very splits critiqued in this section; this creates a tension with the claim that gains will hold for realistic candidate sets (e.g., PubChem or HMDB) that may exhibit larger shifts.

    Authors: We acknowledge the tension. The main tables use the canonical MassSpecGym and Spectraverse splits solely to enable head-to-head comparison with all previously published numbers; this is the conventional practice when introducing a new method. Section 5 then quantifies the leakage-versus-shift trade-off for these and alternative splits using the new metric we introduce. In the revision we will add an explicit paragraph in the discussion that (i) states the scope of the current claims is the standard benchmarks and (ii) notes that larger gains or smaller gains may be observed under higher-shift regimes. We will also include a short additional experiment that evaluates MSAlign on one higher-shift split constructed according to the metric, thereby directly addressing the referee’s concern about realistic candidate sets. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method and claims are empirically grounded in independent pre-trained models and external benchmarks

full rationale

The paper's core contribution is an empirical alignment method (MSAlign) that freezes independently pre-trained DreaMS and ChemBERTa models, adds lightweight MLP projections, and trains them with a candidate-based contrastive loss on retrieval tasks. This construction does not reduce any claimed performance gain to a fitted parameter defined by the evaluation data itself, nor does it rely on self-citation for load-bearing uniqueness theorems or ansatzes. The newly introduced distribution-shift metric is used to analyze existing splits rather than to derive the method's superiority. All reported outperformance is validated on public benchmarks (MassSpecGym, Spectraverse) with released code and splits, keeping the derivation self-contained against external data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the two cited foundation models and the suitability of the contrastive objective on candidate sets; no new physical entities or ad-hoc constants are introduced beyond standard training hyperparameters.

axioms (1)
  • domain assumption DreaMS and ChemBERTa produce representations that are alignable by simple MLPs for the retrieval task.
    The method freezes these models without further training, assuming their pre-trained features are already sufficient.

pith-pipeline@v0.9.0 · 5822 in / 1321 out tokens · 53248 ms · 2026-05-20T07:43:28.877784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Ahmad, W., Simon, E., Chithrananda, S., Grand, G., and Ramsundar, B. (2022). ChemBERTa-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712

  2. [2]

    Ewald, J., Fraser, P

    Alseekh, S., Aharoni, A., Brotman, Y ., Contrepois, K., D’Auria, J., Ewald, J., C. Ewald, J., Fraser, P. D., Giavalisco, P., Hall, R. D., Heinemann, M., Link, H., Luo, J., Neumann, S., Nielsen, J., Perez de Souza, L., Saito, K., Sauer, U., Schroeder, F. C., Schuster, S., Siuzdak, G., Skirycz, A., Sumner, L. W., Snyder, M. P., Tang, H., Tohge, T., Wang, Y ...

  3. [3]

    J., Taskar, B., and Vishwanathan, S

    Bakır, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., and Vishwanathan, S. V . N., editors (2007). Predicting Structured Data. MIT Press, Cambridge, MA

  4. [4]

    Bittremieux, W., Wang, M., and Dorrestein, P. C. (2022). The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics, 18(12):94

  5. [5]

    Bohde, M., Manjrekar, M., Wang, R., Ji, S., and Coley, C. W. (2025). DiffMS: diffusion generation of molecules conditioned on mass spectra. In Proceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org

  6. [6]

    Brogat-Motte, L., Flamary, R., Brouard, C., Rousu, J., and d’Alché Buc, F. (2022). Learning to predict graphs with fused gromov-wasserstein barycenters. In International Conference on Machine Learning, pages 2321–2335. PMLR

  7. [7]

    Brouard, C., Shen, H., Dührkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016). Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36

  8. [8]

    F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al

    Bushuiev, R., Bushuiev, A., de Jonge, N. F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al. (2024). MassSpecGym: A benchmark for the discovery and identification of molecules. Advances in Neural Information Processing Systems, 37:110010–110027

  9. [9]

    Bushuiev, R., Bushuiev, A., Samusevich, R., Brungs, C., Sivic, J., and Pluskal, T. (2025). Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology, pages 1–11

  10. [10]

    Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR

  11. [11]

    Z., Rushing, B., and Hassoun, S

    Chen, Y . Z., Rushing, B., and Hassoun, S. (2026). FLARE: Fine-grained learning for alignment of spectra-molecule representation enhances metabolite annotation. bioRxiv, pages 2026–01

  12. [12]

    S., Junot, C., Tabet, J.-C., and Fenaille, F

    Damont, A., Darii, E., Cao, C., Legrand, A., Perret, A., Dechaumet, S., Woods, A. S., Junot, C., Tabet, J.-C., and Fenaille, F. (2025). Exploring the fragmentation of sodiated species involving covalent-bond cleavages for metabolite characterization. Rapid Communications in Mass Spectrometry, page e10133

  13. [13]

    de Jonge, N., van der Hooft, J. J. J., and Probst, D. (2025). To Bin or not to Bin: Alternative Representations of Mass Spectra. 10

  14. [14]

    P., Laukens, K., and Cuyckens, F

    De Vijlder, T., Valkenborg, D., Lemière, F., Romijn, E. P., Laukens, K., and Cuyckens, F. (2018). A tutorial in small molecule identification via electrospray ionization-mass spectrometry: The practical art of structural elucidation. Mass Spectrometry Reviews, 37(5):607–629

  15. [15]

    De Waele, G., Wydmuch, M., Waegeman, W., et al. (2026). Small molecule retrieval from tandem mass spectrometry: what are we optimizing for? arXiv preprint arXiv:2602.16507

  16. [16]

    R., Benton, H

    Domingo-Almenara, X., Montenegro-Burke, J. R., Benton, H. P., and Siuzdak, G. (2018). Annotation: A Computational Solution for Streamlining Metabolomics Analysis. Analytical Chemistry, 90(1):480–489

  17. [17]

    A., Melnik, A

    Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V ., Meusel, M., Dorrestein, P. C., Rousu, J., and Böcker, S. (2019). SIRIUS 4: A rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods, 16(4):299–302

  18. [18]

    A., Petras, D., Gerwick, W

    Dührkop, K., Nothias, L.-F., Fleischauer, M., Reher, R., Ludwig, M., Hoffmann, M. A., Petras, D., Gerwick, W. H., Rousu, J., Dorrestein, P. C., et al. (2021). Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature biotechnology, 39(4):462–471

  19. [19]

    Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41):12580–12585

  20. [20]

    El Abiead, Y ., Rutz, A., Zuffa, S., Amer, B., Xing, S., Brungs, C., Schmid, R., Correia, M. S. P., Caraballo- Rodriguez, A. M., Zarrinpar, A., Mannochio-Russo, H., Witting, M., Mohanty, I., Pluskal, T., Bittremieux, W., Knight, R., Patterson, A. D., van der Hooft, J. J. J., Böcker, S., Dunn, W. B., Linington, R. G., Wishart, D. S., Wolfender, J.-L., Fieh...

  21. [21]

    El Ahmad, T., Brogat-Motte, L., Laforgue, P., and d’Alché Buc, F. (2024). Sketch in, sketch out: Accelerating both learning and inference for structured prediction with kernels. In International conference on artificial intelligence and statistics, pages 109–117. PMLR

  22. [22]

    Elser, D., Huber, F., and Gaquerel, E. (2023). Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution ms/ms spectra. bioRxiv, pages 2023–07

  23. [23]

    Fan, Z., Alley, A., Ghaffari, K., and Ressom, H. W. (2020). MetFID: Artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics, 16(10):104

  24. [24]

    Farahani, A., V oghoei, S., Rasheed, K., and Arabnia, H. R. (2021). A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA2020 and IKE 2020, pages 877–894

  25. [25]

    Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al

    Flamary, R., Vincent-Cuaz, C., Courty, N., Gramfort, A., Kachaiev, O., Tran, H. Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al. (2024). Pot python optimal transport (version 0.9. 5), 2024. URL https://github. com/PythonOT/POT, 10

  26. [26]

    J., and Coley, C

    Goldman, S., Wohlwend, J., Stražar, M., Haroush, G., Xavier, R. J., and Coley, C. W. (2023a). Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 5(9):965–979

  27. [27]

    Goldman, S., Xin, J., Provenzano, J., and Coley, C. W. (2023b). MIST-CF: Chemical formula inference from tandem mass spectra. Journal of Chemical Information and Modeling, 64(7):2421–2431

  28. [28]

    Gupta, V ., Qiang, H., Chung, H.-H., Herbst, E., and Skinnider, M. A. (2026). Comprehensive curation and harmonization of small-molecule MS/MS libraries in Spectraverse. Analytical Chemistry, 98(5):3934–3943

  29. [29]

    Han, Y ., Wang, P., Yu, K., Chen, X., and Chen, L. (2025). MS-BART: Unified modeling of mass spectra and molecules for structure elucidation. arXiv preprint arXiv:2510.20615

  30. [30]

    Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28(18):2333–2341

  31. [31]

    and Bittremieux, W

    Heirman, J. and Bittremieux, W. (2024). Reusability report: annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 6(11):1296–1302

  32. [32]

    Hong, Y ., Li, S., Ye, Y ., and Tang, H. (2025). FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra. Nature Communications, 16(1):11102

  33. [33]

    Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). Position: The platonic representation hypothesis. In ICML, pages 20617–20642

  34. [34]

    Ji, H., Deng, H., Lu, H., and Zhang, Z. (2020). Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Analytical chemistry, 92(13):8649–8653

  35. [35]

    Ji, X., Wang, Z., Gao, Z., Zheng, H., Zhang, L., Ke, G., et al. (2024). Uni-Mol2: Exploring molecular pretraining model at scale. arXiv preprint arXiv:2406.14969. 11

  36. [36]

    Kalia, A., Zhou Chen, Y ., Krishnan, D., and Hassoun, S. (2025). JESTR: Joint embedding space tech- nique for ranking candidate molecules for the annotation of untargeted metabolomics data. Bioinformatics, 41(7):btaf354

  37. [37]

    A., Thiessen, P

    Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., et al. (2023). PubChem 2023 update. Nucleic acids research, 51(D1):D1373–D1380

  38. [38]

    S., Wohlgemuth, G., Barupal, D

    Kind, T., Tsugawa, H., Cajka, T., Ma, Y ., Lai, Z., Mehta, S. S., Wohlgemuth, G., Barupal, D. K., Showalter, M. R., Arita, M., and Fiehn, O. (2018). Identification of small molecules using accurate mass MS/MS search. Mass Spectrometry Reviews, 37(4):513–532

  39. [39]

    Krzakala, P., Melo, G., Laclau, C., d’Alché Buc, F., and Flamary, R. (2025). The quest for the graph level autoencoder (GRALE). arXiv preprint arXiv:2505.22109

  40. [40]

    Kudriavtseva, P., Kashkinov, M., and Kertész-Farkas, A. (2021). Deep convolutional neural networks help scoring tandem mass spectrometry data in database-searching approaches. Journal of proteome research, 20(10):4708–4717

  41. [41]

    Landrum, G. et al. (2013). Rdkit documentation. Release, 1(1-79):4

  42. [42]

    LeCun, Y ., Chopra, S., Hadsell, R., Ranzato, M., Huang, F., et al. (2006). A tutorial on energy-based learning. Predicting structured data, 1(0)

  43. [43]

    Litsa, E., Chenthamarakshan, V ., Das, P., and Kavraki, L. (2021). Spec2Mol: An end-to-end deep learning framework for translating ms/ms spectra to de-novo molecules. ChemRxiv

  44. [44]

    D., Dorrestein, P

    Ludwig, M., Broeckling, C. D., Dorrestein, P. C., Dührkop, K., Schymanski, E. L., Böcker, S., and Nothias, L.-F. (2020). Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. Journal of the American Society for Mass Spectrometry, 32(1):180–186

  45. [45]

    Méndez-Lucio, O., Nicolaou, C., and Earnshaw, B. (2022). MolE: a molecular foundation model for drug discovery. arXiv preprint arXiv:2211.02657

  46. [46]

    H., Nguyen, C

    Nguyen, D. H., Nguyen, C. H., and Mamitsuka, H. (2019). Recent advances and prospects of computational methods for metabolite identification: A review with emphasis on machine learning approaches. Briefings in Bioinformatics, 20(6):2028–2043

  47. [47]

    and Lampert, C

    Nowozin, S. and Lampert, C. H. (2011). Structured prediction and learning in computer vision.Foundations and Trends in Computer Graphics and Vision, 6(3-4):3–4

  48. [48]

    and Cuturi, M

    Peyré, G. and Cuturi, M. (2019). Computational optimal transport with applications to data sciences. Foundations and Trends® in Machine Learning, 11(5-6):355–607

  49. [49]

    Pollmann, J., Bushuiev, R., Bushuiev, A., Pluskal, T., and Huber, F. (2026). Bridging ms2 spectra and chemical space: Advances in spectral similarity, molecular retrieval, and de novo structure discovery. chemrxiv.15000536

  50. [50]

    A., Melnik, A

    Quinn, R. A., Melnik, A. V ., Vrbanac, A., Fu, T., Patras, K. A., Christy, M. P., Bodai, Z., Belda-Ferre, P., Tripathi, A., Chung, L. K., Downes, M., Welch, R. D., Quinn, M., Humphrey, G., Panitchpakdi, M., Weldon, K. C., Aksenov, A., da Silva, R., Avila-Pacheco, J., Clish, C., Bae, S., Mallick, H., Franzosa, E. A., Lloyd-Price, J., Bussell, R., Thron, T....

  51. [51]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763

  52. [52]

    Rakhshaninejad, M., De Waele, G., Jürgens, M., and Waegeman, W. (2026). Reliable molecular retrieval from mass spectra using conformal prediction. bioRxiv, pages 2026–03

  53. [53]

    Robinson, J., Chuang, C.-Y ., Sra, S., and Jegelka, S. (2020). Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592

  54. [54]

    Rong, Y ., Bian, Y ., Xu, T., Xie, W., Wei, Y ., Huang, W., and Huang, J. (2020). Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559– 12571

  55. [55]

    Roschmann, S., Krzakala, P., Mazelet, S., Bouniot, Q., and Akata, Z. (2026). SOTAlign: Semi-supervised alignment of unimodal vision and language models via optimal transport. arXiv preprint arXiv:2602.23353

  56. [56]

    F., Nowatzky, Y ., Jaeger, C., Parr, M

    Russo, F. F., Nowatzky, Y ., Jaeger, C., Parr, M. K., Benner, P., Muth, T., and Lisec, J. (2024). Machine learning methods for compound annotation in non-targeted mass spectrometry—A brief overview of fin- gerprinting, in silico fragmentation and de novo methods. Rapid Communications in Mass Spectrometry, 38(20):e9876. 12

  57. [57]

    L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H

    Schymanski, E. L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H. P., and Hollender, J. (2014). Identi- fying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence. Environmental Science & Technology, 48(4):2097–2098

  58. [58]

    A., Dührkop, K., Böcker, S., and Zamboni, N

    Stravs, M. A., Dührkop, K., Böcker, S., and Zamboni, N. (2022). MSNovelist: de novo structure generation from mass spectra. Nature Methods, 19(7):865–870

  59. [59]

    Thirukovalluru, R., Meng, R., Liu, Y ., Su, M., Nie, P., Yavuz, S., Zhou, Y ., Chen, W., Dhingra, B., et al. (2025). Breaking the batch barrier (B3) of contrastive learning via smart batch mining. arXiv preprint arXiv:2505.11293

  60. [60]

    and Fiehn, O

    Vaniya, A. and Fiehn, O. (2022). Revisiting CASMI: Compound ID for 500 new unknowns, using LC-MS/MS data

  61. [61]

    N., Kaiser, Ł., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30

  62. [62]

    K., Villecroze, V ., Cresswell, J

    V ouitsis, N., Liu, Z., Gorti, S. K., Villecroze, V ., Cresswell, J. C., Yu, G., Loaiza-Ganem, G., and V olkovs, M. (2024). Data-efficient multimodal fusion on a single GPU. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27239–27251

  63. [63]

    and Isola, P

    Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR

  64. [64]

    Wang, Y ., Chen, X., Liu, L., and Hassoun, S. (2025). MADGEN: Mass-spec attends to de novo molecular generation. arXiv preprint arXiv:2501.01950

  65. [65]

    Wishart, D. S. (2019). Metabolomics for investigating physiological and pathophysiological processes. Physiological Reviews, 99(4):1819–1875

  66. [66]

    N., Gomes, J., Geniesse, C., Pappu, A

    Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V . (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530

  67. [67]

    Xing, S., Shen, S., Xu, B., Li, X., and Huan, T. (2023). BUDDY: Molecular formula discovery via bottom-up MS/MS interrogation. Nature Methods, 20(6):881–890

  68. [68]

    and Zhu, J

    Xu, R. and Zhu, J. (2025). Unveiling the dark matter of the metabolome: A narrative review of bioinfor- matics tools for LC-HRMS-based compound annotation. Talanta, 295:128327

  69. [69]

    expert” models, which can be selected at inference time when the adduct is known. In Table 9 we report the results of these different strategies. In the

    Zhang, L., Yang, Q., and Agrawal, A. (2025). Assessing and learning alignment of unimodal vision and language models. In CVPR, pages 14604–14614. 13 A Implementation details All SMILES strings are canonicalized and sanitized using RDKit [ 41]. Chemical formulas and weights are computed by explicitly accounting for implicit hydrogen atoms. The molecular ma...

  70. [70]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...