MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

Camille Lan\c{c}on; Charlotte Laclau; Etienne Th\'evenot; Florence d'Alch\'e-Buc; Gabriel Melo; Paul Krzakala; R\'emi Flamary

arxiv: 2605.19752 · v1 · pith:XN7JKINTnew · submitted 2026-05-19 · 💻 cs.LG

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

Paul Krzakala , Gabriel Melo , Camille Lan\c{c}on , Charlotte Laclau , R\'emi Flamary , Etienne Th\'evenot , Florence d'Alch\'e-Buc This is my paper

Pith reviewed 2026-05-20 07:43 UTC · model grok-4.3

classification 💻 cs.LG

keywords metabolite identificationmass spectrometrymolecule retrievalcontrastive learningrepresentation alignmentfoundation modelsmultimodal learningmetabolomics

0 comments

The pith

Aligning frozen foundation models for mass spectra and molecules via lightweight projections and contrastive learning improves retrieval of metabolite structures from spectra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a straightforward method for aligning pre-trained models of mass spectra and chemical molecules can lead to better performance in identifying metabolites from their mass spectrometry data. This matters because accurate metabolite identification is crucial for applications like drug discovery and environmental analysis. The approach uses simple multilayer perceptron projections on top of frozen foundation models, trained with a contrastive objective based on candidate molecules. It is presented as easier to implement and faster to train than existing methods while achieving higher accuracy across benchmarks. Additionally, the work examines how different data splitting strategies affect evaluation by measuring distribution shift.

Core claim

MSAlign learns a shared representation space by aligning two frozen foundation models through lightweight MLP projections trained with a candidate-based contrastive objective, leading to consistent outperformance over existing approaches in molecule retrieval from mass spectra.

What carries the argument

MSAlign, the method that aligns frozen foundation models for mass spectra and molecules using lightweight MLP projections and candidate-based contrastive training to create a shared representation space for improved retrieval.

If this is right

MSAlign is simple to implement and fast to train compared to prior methods.
It consistently outperforms existing approaches across all benchmarks for molecule retrieval.
The candidate-based contrastive objective enables effective alignment without joint fine-tuning of the foundation models.
Quantifying distribution shift provides a way to evaluate and improve data splitting strategies in retrieval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach indicates that lightweight alignment can suffice for multimodal tasks in chemistry instead of full model retraining.
The technique might extend to aligning other types of spectral and structural data in related scientific domains.
Releasing unified code and splits could standardize comparisons and reduce implementation barriers in metabolomics research.

Load-bearing premise

The frozen foundation models for mass spectra and molecules already encode sufficiently rich and compatible features that lightweight MLP projections plus candidate-based contrastive training can reliably improve retrieval without needing joint fine-tuning or suffering from distribution shift in real-world candidate sets.

What would settle it

Observing that MSAlign does not outperform baselines on a benchmark with substantial distribution shift between the candidate sets used in training and real-world conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.19752 by Camille Lan\c{c}on, Charlotte Laclau, Etienne Th\'evenot, Florence d'Alch\'e-Buc, Gabriel Melo, Paul Krzakala, R\'emi Flamary.

**Figure 2.** Figure 2: t-SNE visualization of the MassSpecGym MCES splits. Possible strategies. A variety of splitting strategies have been proposed in the literature, most aiming to ensure that test molecules differ sufficiently from those in the training set, with various definitions of “dissimilarity”. The MCES split of MassSpecGym [52] is based on the Maximum Common Edge Subgraph (MCES) similarity and enforce enforces a min… view at source ↗

**Figure 4.** Figure 4: Similarities between pairs of candidates. In the ChemBERTa space, candidates are close to each other (average is µpc = 0.63), whereas MSAlign learns a space where they are easier to distinguish (µpc = −0.06)). proposed to align pretrained vision and language models [69], but it can be directly adapted to our setting for aligning molecules (ChemBERTa) and mass spectra (DreaMS); following the original work,… view at source ↗

**Figure 5.** Figure 5: Effect of scaling the effective batch size on Spectraverse performances. For MSAlign the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of all MassSpecGym splits. We sample 2000 samples from the train [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSAlign gives a straightforward alignment of frozen DreaMS and ChemBERTa models via candidate contrastive loss, plus a public unified codebase and a new quantitative measure of distribution shift in splits.

read the letter

The main thing to know is that this paper delivers a practical, easy-to-run method for aligning pre-trained mass spectra and molecule embeddings to improve metabolite retrieval, backed by a full public release of code, data, and splits. It also formalizes the leakage-versus-shift problem in evaluation with a new metric. That combination makes the work more useful than the core idea alone would suggest. The method freezes DreaMS for spectra and ChemBERTa for molecules, adds small MLP projections, and trains them with a contrastive loss that uses candidate sets. Training stays fast and the approach stays simple. They report consistent gains over baselines on the available benchmarks, and the unified framework they built should make it easier for others to reproduce or extend the comparisons. Releasing everything publicly is the part that stands out most. The shift metric is a reasonable addition too. It turns a long-standing complaint about splitting strategies into something measurable, which could help people pick better train-test divisions in future work. The soft spots are mostly around generalization. The gains rest on the assumption that the frozen foundation models already encode enough compatible structure-spectrum information that lightweight alignment can close the remaining gap. If real candidate sets drawn from PubChem or HMDB create larger shifts than the benchmark splits, the frozen-model shortcut may not hold and joint fine-tuning could become necessary. The paper itself flags the split issue, so it would help to see how their own splits score on the new metric and whether they include harder shift cases. Without visible error bars or detailed ablations in the summary, the robustness of the reported wins is still hard to judge fully. This is aimed at metabolomics researchers who build or use ML tools for molecule identification from spectra. Anyone working on multimodal alignment or evaluation practices in chemistry-adjacent ML will find the code and the shift analysis worth looking at. It deserves a serious referee. The reproducibility push and the evaluation contribution give it enough substance to justify review time, even if the alignment technique itself extends existing contrastive methods rather than inventing a new paradigm.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MSAlign, a lightweight method to align frozen DreaMS (mass spectra) and ChemBERTa (molecules) foundation models via MLP projections trained with a candidate-based contrastive objective for metabolite identification from MS/MS spectra. It also proposes a quantitative distribution-shift metric to analyze data-splitting strategies, critiques existing benchmarks for trading leakage against shift, and releases unified code, datasets, splits, and baseline implementations.

Significance. If the central results hold, MSAlign shows that simple alignment of independently pre-trained models can deliver consistent retrieval gains without joint fine-tuning, providing an efficient and reproducible approach. The distribution-shift metric addresses a persistent evaluation issue in molecule retrieval. The public release of all datasets, splits, candidate sets, and a unified implementation framework is a clear strength that supports reproducibility and future work.

major comments (2)

[§4] §4 (experimental results): the central claim of consistent outperformance across all benchmarks is presented without error bars, standard deviations, or statistical significance tests over multiple random seeds or runs; this makes it difficult to assess whether the reported gains over baselines are robust.
[§5] §5 (distribution shift analysis): the paper introduces a quantitative shift measure and explicitly notes that existing splits trade leakage against domain shift, yet the main benchmark results rely on the very splits critiqued in this section; this creates a tension with the claim that gains will hold for realistic candidate sets (e.g., PubChem or HMDB) that may exhibit larger shifts.

minor comments (2)

[Figure 1] The architecture diagram (Figure 1 or 2) would benefit from explicit notation of the projection dimensions and the exact form of the contrastive loss to improve clarity for readers implementing the method.
[Table 2] Table 2 (or equivalent results table) lists baseline comparisons but does not indicate which components of the unified framework were used for each baseline; adding a short column or footnote would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment point by point below and describe the revisions we will make to improve the robustness of the reported results and to clarify the relationship between our benchmark evaluations and the distribution-shift analysis.

read point-by-point responses

Referee: [§4] §4 (experimental results): the central claim of consistent outperformance across all benchmarks is presented without error bars, standard deviations, or statistical significance tests over multiple random seeds or runs; this makes it difficult to assess whether the reported gains over baselines are robust.

Authors: We agree that reporting variability across runs would strengthen the central claims. In the revised manuscript we will rerun all experiments with at least five random seeds, report mean performance together with standard deviations for every metric and baseline, and add paired statistical significance tests (e.g., Wilcoxon signed-rank) between MSAlign and the strongest baselines on each benchmark. revision: yes
Referee: [§5] §5 (distribution shift analysis): the paper introduces a quantitative shift measure and explicitly notes that existing splits trade leakage against domain shift, yet the main benchmark results rely on the very splits critiqued in this section; this creates a tension with the claim that gains will hold for realistic candidate sets (e.g., PubChem or HMDB) that may exhibit larger shifts.

Authors: We acknowledge the tension. The main tables use the canonical MassSpecGym and Spectraverse splits solely to enable head-to-head comparison with all previously published numbers; this is the conventional practice when introducing a new method. Section 5 then quantifies the leakage-versus-shift trade-off for these and alternative splits using the new metric we introduce. In the revision we will add an explicit paragraph in the discussion that (i) states the scope of the current claims is the standard benchmarks and (ii) notes that larger gains or smaller gains may be observed under higher-shift regimes. We will also include a short additional experiment that evaluates MSAlign on one higher-shift split constructed according to the metric, thereby directly addressing the referee’s concern about realistic candidate sets. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method and claims are empirically grounded in independent pre-trained models and external benchmarks

full rationale

The paper's core contribution is an empirical alignment method (MSAlign) that freezes independently pre-trained DreaMS and ChemBERTa models, adds lightweight MLP projections, and trains them with a candidate-based contrastive loss on retrieval tasks. This construction does not reduce any claimed performance gain to a fitted parameter defined by the evaluation data itself, nor does it rely on self-citation for load-bearing uniqueness theorems or ansatzes. The newly introduced distribution-shift metric is used to analyze existing splits rather than to derive the method's superiority. All reported outperformance is validated on public benchmarks (MassSpecGym, Spectraverse) with released code and splits, keeping the derivation self-contained against external data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the two cited foundation models and the suitability of the contrastive objective on candidate sets; no new physical entities or ad-hoc constants are introduced beyond standard training hyperparameters.

axioms (1)

domain assumption DreaMS and ChemBERTa produce representations that are alignable by simple MLPs for the retrieval task.
The method freezes these models without further training, assuming their pre-trained features are already sufficient.

pith-pipeline@v0.9.0 · 5822 in / 1321 out tokens · 53248 ms · 2026-05-20T07:43:28.877784+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSAlign ... aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we formalize this tension by introducing a quantitative measure of distribution shift

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Ahmad, W., Simon, E., Chithrananda, S., Grand, G., and Ramsundar, B. (2022). ChemBERTa-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712

work page arXiv 2022
[2]

Ewald, J., Fraser, P

Alseekh, S., Aharoni, A., Brotman, Y ., Contrepois, K., D’Auria, J., Ewald, J., C. Ewald, J., Fraser, P. D., Giavalisco, P., Hall, R. D., Heinemann, M., Link, H., Luo, J., Neumann, S., Nielsen, J., Perez de Souza, L., Saito, K., Sauer, U., Schroeder, F. C., Schuster, S., Siuzdak, G., Skirycz, A., Sumner, L. W., Snyder, M. P., Tang, H., Tohge, T., Wang, Y ...

work page 2021
[3]

J., Taskar, B., and Vishwanathan, S

Bakır, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., and Vishwanathan, S. V . N., editors (2007). Predicting Structured Data. MIT Press, Cambridge, MA

work page 2007
[4]

Bittremieux, W., Wang, M., and Dorrestein, P. C. (2022). The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics, 18(12):94

work page 2022
[5]

Bohde, M., Manjrekar, M., Wang, R., Ji, S., and Coley, C. W. (2025). DiffMS: diffusion generation of molecules conditioned on mass spectra. In Proceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org

work page 2025
[6]

Brogat-Motte, L., Flamary, R., Brouard, C., Rousu, J., and d’Alché Buc, F. (2022). Learning to predict graphs with fused gromov-wasserstein barycenters. In International Conference on Machine Learning, pages 2321–2335. PMLR

work page 2022
[7]

Brouard, C., Shen, H., Dührkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016). Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36

work page 2016
[8]

F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al

Bushuiev, R., Bushuiev, A., de Jonge, N. F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al. (2024). MassSpecGym: A benchmark for the discovery and identification of molecules. Advances in Neural Information Processing Systems, 37:110010–110027

work page 2024
[9]

Bushuiev, R., Bushuiev, A., Samusevich, R., Brungs, C., Sivic, J., and Pluskal, T. (2025). Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology, pages 1–11

work page 2025
[10]

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR

work page 2020
[11]

Z., Rushing, B., and Hassoun, S

Chen, Y . Z., Rushing, B., and Hassoun, S. (2026). FLARE: Fine-grained learning for alignment of spectra-molecule representation enhances metabolite annotation. bioRxiv, pages 2026–01

work page 2026
[12]

S., Junot, C., Tabet, J.-C., and Fenaille, F

Damont, A., Darii, E., Cao, C., Legrand, A., Perret, A., Dechaumet, S., Woods, A. S., Junot, C., Tabet, J.-C., and Fenaille, F. (2025). Exploring the fragmentation of sodiated species involving covalent-bond cleavages for metabolite characterization. Rapid Communications in Mass Spectrometry, page e10133

work page 2025
[13]

de Jonge, N., van der Hooft, J. J. J., and Probst, D. (2025). To Bin or not to Bin: Alternative Representations of Mass Spectra. 10

work page 2025
[14]

P., Laukens, K., and Cuyckens, F

De Vijlder, T., Valkenborg, D., Lemière, F., Romijn, E. P., Laukens, K., and Cuyckens, F. (2018). A tutorial in small molecule identification via electrospray ionization-mass spectrometry: The practical art of structural elucidation. Mass Spectrometry Reviews, 37(5):607–629

work page 2018
[15]

De Waele, G., Wydmuch, M., Waegeman, W., et al. (2026). Small molecule retrieval from tandem mass spectrometry: what are we optimizing for? arXiv preprint arXiv:2602.16507

work page arXiv 2026
[16]

R., Benton, H

Domingo-Almenara, X., Montenegro-Burke, J. R., Benton, H. P., and Siuzdak, G. (2018). Annotation: A Computational Solution for Streamlining Metabolomics Analysis. Analytical Chemistry, 90(1):480–489

work page 2018
[17]

A., Melnik, A

Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V ., Meusel, M., Dorrestein, P. C., Rousu, J., and Böcker, S. (2019). SIRIUS 4: A rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods, 16(4):299–302

work page 2019
[18]

A., Petras, D., Gerwick, W

Dührkop, K., Nothias, L.-F., Fleischauer, M., Reher, R., Ludwig, M., Hoffmann, M. A., Petras, D., Gerwick, W. H., Rousu, J., Dorrestein, P. C., et al. (2021). Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature biotechnology, 39(4):462–471

work page 2021
[19]

Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41):12580–12585

work page 2015
[20]

El Abiead, Y ., Rutz, A., Zuffa, S., Amer, B., Xing, S., Brungs, C., Schmid, R., Correia, M. S. P., Caraballo- Rodriguez, A. M., Zarrinpar, A., Mannochio-Russo, H., Witting, M., Mohanty, I., Pluskal, T., Bittremieux, W., Knight, R., Patterson, A. D., van der Hooft, J. J. J., Böcker, S., Dunn, W. B., Linington, R. G., Wishart, D. S., Wolfender, J.-L., Fieh...

work page 2025
[21]

El Ahmad, T., Brogat-Motte, L., Laforgue, P., and d’Alché Buc, F. (2024). Sketch in, sketch out: Accelerating both learning and inference for structured prediction with kernels. In International conference on artificial intelligence and statistics, pages 109–117. PMLR

work page 2024
[22]

Elser, D., Huber, F., and Gaquerel, E. (2023). Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution ms/ms spectra. bioRxiv, pages 2023–07

work page 2023
[23]

Fan, Z., Alley, A., Ghaffari, K., and Ressom, H. W. (2020). MetFID: Artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics, 16(10):104

work page 2020
[24]

Farahani, A., V oghoei, S., Rasheed, K., and Arabnia, H. R. (2021). A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA2020 and IKE 2020, pages 877–894

work page 2021
[25]

Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al

Flamary, R., Vincent-Cuaz, C., Courty, N., Gramfort, A., Kachaiev, O., Tran, H. Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al. (2024). Pot python optimal transport (version 0.9. 5), 2024. URL https://github. com/PythonOT/POT, 10

work page 2024
[26]

J., and Coley, C

Goldman, S., Wohlwend, J., Stražar, M., Haroush, G., Xavier, R. J., and Coley, C. W. (2023a). Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 5(9):965–979

work page
[27]

Goldman, S., Xin, J., Provenzano, J., and Coley, C. W. (2023b). MIST-CF: Chemical formula inference from tandem mass spectra. Journal of Chemical Information and Modeling, 64(7):2421–2431

work page
[28]

Gupta, V ., Qiang, H., Chung, H.-H., Herbst, E., and Skinnider, M. A. (2026). Comprehensive curation and harmonization of small-molecule MS/MS libraries in Spectraverse. Analytical Chemistry, 98(5):3934–3943

work page 2026
[29]

Han, Y ., Wang, P., Yu, K., Chen, X., and Chen, L. (2025). MS-BART: Unified modeling of mass spectra and molecules for structure elucidation. arXiv preprint arXiv:2510.20615

work page arXiv 2025
[30]

Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28(18):2333–2341

work page 2012
[31]

and Bittremieux, W

Heirman, J. and Bittremieux, W. (2024). Reusability report: annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 6(11):1296–1302

work page 2024
[32]

Hong, Y ., Li, S., Ye, Y ., and Tang, H. (2025). FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra. Nature Communications, 16(1):11102

work page 2025
[33]

Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). Position: The platonic representation hypothesis. In ICML, pages 20617–20642

work page 2024
[34]

Ji, H., Deng, H., Lu, H., and Zhang, Z. (2020). Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Analytical chemistry, 92(13):8649–8653

work page 2020
[35]

Ji, X., Wang, Z., Gao, Z., Zheng, H., Zhang, L., Ke, G., et al. (2024). Uni-Mol2: Exploring molecular pretraining model at scale. arXiv preprint arXiv:2406.14969. 11

work page arXiv 2024
[36]

Kalia, A., Zhou Chen, Y ., Krishnan, D., and Hassoun, S. (2025). JESTR: Joint embedding space tech- nique for ranking candidate molecules for the annotation of untargeted metabolomics data. Bioinformatics, 41(7):btaf354

work page 2025
[37]

A., Thiessen, P

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., et al. (2023). PubChem 2023 update. Nucleic acids research, 51(D1):D1373–D1380

work page 2023
[38]

S., Wohlgemuth, G., Barupal, D

Kind, T., Tsugawa, H., Cajka, T., Ma, Y ., Lai, Z., Mehta, S. S., Wohlgemuth, G., Barupal, D. K., Showalter, M. R., Arita, M., and Fiehn, O. (2018). Identification of small molecules using accurate mass MS/MS search. Mass Spectrometry Reviews, 37(4):513–532

work page 2018
[39]

Krzakala, P., Melo, G., Laclau, C., d’Alché Buc, F., and Flamary, R. (2025). The quest for the graph level autoencoder (GRALE). arXiv preprint arXiv:2505.22109

work page arXiv 2025
[40]

Kudriavtseva, P., Kashkinov, M., and Kertész-Farkas, A. (2021). Deep convolutional neural networks help scoring tandem mass spectrometry data in database-searching approaches. Journal of proteome research, 20(10):4708–4717

work page 2021
[41]

Landrum, G. et al. (2013). Rdkit documentation. Release, 1(1-79):4

work page 2013
[42]

LeCun, Y ., Chopra, S., Hadsell, R., Ranzato, M., Huang, F., et al. (2006). A tutorial on energy-based learning. Predicting structured data, 1(0)

work page 2006
[43]

Litsa, E., Chenthamarakshan, V ., Das, P., and Kavraki, L. (2021). Spec2Mol: An end-to-end deep learning framework for translating ms/ms spectra to de-novo molecules. ChemRxiv

work page 2021
[44]

D., Dorrestein, P

Ludwig, M., Broeckling, C. D., Dorrestein, P. C., Dührkop, K., Schymanski, E. L., Böcker, S., and Nothias, L.-F. (2020). Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. Journal of the American Society for Mass Spectrometry, 32(1):180–186

work page 2020
[45]

Méndez-Lucio, O., Nicolaou, C., and Earnshaw, B. (2022). MolE: a molecular foundation model for drug discovery. arXiv preprint arXiv:2211.02657

work page arXiv 2022
[46]

H., Nguyen, C

Nguyen, D. H., Nguyen, C. H., and Mamitsuka, H. (2019). Recent advances and prospects of computational methods for metabolite identification: A review with emphasis on machine learning approaches. Briefings in Bioinformatics, 20(6):2028–2043

work page 2019
[47]

and Lampert, C

Nowozin, S. and Lampert, C. H. (2011). Structured prediction and learning in computer vision.Foundations and Trends in Computer Graphics and Vision, 6(3-4):3–4

work page 2011
[48]

and Cuturi, M

Peyré, G. and Cuturi, M. (2019). Computational optimal transport with applications to data sciences. Foundations and Trends® in Machine Learning, 11(5-6):355–607

work page 2019
[49]

Pollmann, J., Bushuiev, R., Bushuiev, A., Pluskal, T., and Huber, F. (2026). Bridging ms2 spectra and chemical space: Advances in spectral similarity, molecular retrieval, and de novo structure discovery. chemrxiv.15000536

work page 2026
[50]

A., Melnik, A

Quinn, R. A., Melnik, A. V ., Vrbanac, A., Fu, T., Patras, K. A., Christy, M. P., Bodai, Z., Belda-Ferre, P., Tripathi, A., Chung, L. K., Downes, M., Welch, R. D., Quinn, M., Humphrey, G., Panitchpakdi, M., Weldon, K. C., Aksenov, A., da Silva, R., Avila-Pacheco, J., Clish, C., Bae, S., Mallick, H., Franzosa, E. A., Lloyd-Price, J., Bussell, R., Thron, T....

work page 2020
[51]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763

work page 2021
[52]

Rakhshaninejad, M., De Waele, G., Jürgens, M., and Waegeman, W. (2026). Reliable molecular retrieval from mass spectra using conformal prediction. bioRxiv, pages 2026–03

work page 2026
[53]

Robinson, J., Chuang, C.-Y ., Sra, S., and Jegelka, S. (2020). Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592

work page arXiv 2020
[54]

Rong, Y ., Bian, Y ., Xu, T., Xie, W., Wei, Y ., Huang, W., and Huang, J. (2020). Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559– 12571

work page 2020
[55]

Roschmann, S., Krzakala, P., Mazelet, S., Bouniot, Q., and Akata, Z. (2026). SOTAlign: Semi-supervised alignment of unimodal vision and language models via optimal transport. arXiv preprint arXiv:2602.23353

work page arXiv 2026
[56]

F., Nowatzky, Y ., Jaeger, C., Parr, M

Russo, F. F., Nowatzky, Y ., Jaeger, C., Parr, M. K., Benner, P., Muth, T., and Lisec, J. (2024). Machine learning methods for compound annotation in non-targeted mass spectrometry—A brief overview of fin- gerprinting, in silico fragmentation and de novo methods. Rapid Communications in Mass Spectrometry, 38(20):e9876. 12

work page 2024
[57]

L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H

Schymanski, E. L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H. P., and Hollender, J. (2014). Identi- fying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence. Environmental Science & Technology, 48(4):2097–2098

work page 2014
[58]

A., Dührkop, K., Böcker, S., and Zamboni, N

Stravs, M. A., Dührkop, K., Böcker, S., and Zamboni, N. (2022). MSNovelist: de novo structure generation from mass spectra. Nature Methods, 19(7):865–870

work page 2022
[59]

Thirukovalluru, R., Meng, R., Liu, Y ., Su, M., Nie, P., Yavuz, S., Zhou, Y ., Chen, W., Dhingra, B., et al. (2025). Breaking the batch barrier (B3) of contrastive learning via smart batch mining. arXiv preprint arXiv:2505.11293

work page arXiv 2025
[60]

and Fiehn, O

Vaniya, A. and Fiehn, O. (2022). Revisiting CASMI: Compound ID for 500 new unknowns, using LC-MS/MS data

work page 2022
[61]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30

work page 2017
[62]

K., Villecroze, V ., Cresswell, J

V ouitsis, N., Liu, Z., Gorti, S. K., Villecroze, V ., Cresswell, J. C., Yu, G., Loaiza-Ganem, G., and V olkovs, M. (2024). Data-efficient multimodal fusion on a single GPU. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27239–27251

work page 2024
[63]

and Isola, P

Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR

work page 2020
[64]

Wang, Y ., Chen, X., Liu, L., and Hassoun, S. (2025). MADGEN: Mass-spec attends to de novo molecular generation. arXiv preprint arXiv:2501.01950

work page arXiv 2025
[65]

Wishart, D. S. (2019). Metabolomics for investigating physiological and pathophysiological processes. Physiological Reviews, 99(4):1819–1875

work page 2019
[66]

N., Gomes, J., Geniesse, C., Pappu, A

Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V . (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530

work page 2018
[67]

Xing, S., Shen, S., Xu, B., Li, X., and Huan, T. (2023). BUDDY: Molecular formula discovery via bottom-up MS/MS interrogation. Nature Methods, 20(6):881–890

work page 2023
[68]

and Zhu, J

Xu, R. and Zhu, J. (2025). Unveiling the dark matter of the metabolome: A narrative review of bioinfor- matics tools for LC-HRMS-based compound annotation. Talanta, 295:128327

work page 2025
[69]

expert” models, which can be selected at inference time when the adduct is known. In Table 9 we report the results of these different strategies. In the

Zhang, L., Yang, Q., and Agrawal, A. (2025). Assessing and learning alignment of unimodal vision and language models. In CVPR, pages 14604–14614. 13 A Implementation details All SMILES strings are canonicalized and sanitized using RDKit [ 41]. Chemical formulas and weights are computed by explicitly accounting for implicit hydrogen atoms. The molecular ma...

work page 2025
[70]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Ahmad, W., Simon, E., Chithrananda, S., Grand, G., and Ramsundar, B. (2022). ChemBERTa-2: Towards chemical foundation models. arXiv preprint arXiv:2209.01712

work page arXiv 2022

[2] [2]

Ewald, J., Fraser, P

Alseekh, S., Aharoni, A., Brotman, Y ., Contrepois, K., D’Auria, J., Ewald, J., C. Ewald, J., Fraser, P. D., Giavalisco, P., Hall, R. D., Heinemann, M., Link, H., Luo, J., Neumann, S., Nielsen, J., Perez de Souza, L., Saito, K., Sauer, U., Schroeder, F. C., Schuster, S., Siuzdak, G., Skirycz, A., Sumner, L. W., Snyder, M. P., Tang, H., Tohge, T., Wang, Y ...

work page 2021

[3] [3]

J., Taskar, B., and Vishwanathan, S

Bakır, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., and Vishwanathan, S. V . N., editors (2007). Predicting Structured Data. MIT Press, Cambridge, MA

work page 2007

[4] [4]

Bittremieux, W., Wang, M., and Dorrestein, P. C. (2022). The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics, 18(12):94

work page 2022

[5] [5]

Bohde, M., Manjrekar, M., Wang, R., Ji, S., and Coley, C. W. (2025). DiffMS: diffusion generation of molecules conditioned on mass spectra. In Proceedings of the 42nd International Conference on Machine Learning, ICML’25. JMLR.org

work page 2025

[6] [6]

Brogat-Motte, L., Flamary, R., Brouard, C., Rousu, J., and d’Alché Buc, F. (2022). Learning to predict graphs with fused gromov-wasserstein barycenters. In International Conference on Machine Learning, pages 2321–2335. PMLR

work page 2022

[7] [7]

Brouard, C., Shen, H., Dührkop, K., d’Alché Buc, F., Böcker, S., and Rousu, J. (2016). Fast metabolite identification with input output kernel regression. Bioinformatics, 32(12):i28–i36

work page 2016

[8] [8]

F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al

Bushuiev, R., Bushuiev, A., de Jonge, N. F., Young, A., Kretschmer, F., Samusevich, R., Heirman, J., Wang, F., Zhang, L., Dührkop, K., et al. (2024). MassSpecGym: A benchmark for the discovery and identification of molecules. Advances in Neural Information Processing Systems, 37:110010–110027

work page 2024

[9] [9]

Bushuiev, R., Bushuiev, A., Samusevich, R., Brungs, C., Sivic, J., and Pluskal, T. (2025). Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nature Biotechnology, pages 1–11

work page 2025

[10] [10]

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR

work page 2020

[11] [11]

Z., Rushing, B., and Hassoun, S

Chen, Y . Z., Rushing, B., and Hassoun, S. (2026). FLARE: Fine-grained learning for alignment of spectra-molecule representation enhances metabolite annotation. bioRxiv, pages 2026–01

work page 2026

[12] [12]

S., Junot, C., Tabet, J.-C., and Fenaille, F

Damont, A., Darii, E., Cao, C., Legrand, A., Perret, A., Dechaumet, S., Woods, A. S., Junot, C., Tabet, J.-C., and Fenaille, F. (2025). Exploring the fragmentation of sodiated species involving covalent-bond cleavages for metabolite characterization. Rapid Communications in Mass Spectrometry, page e10133

work page 2025

[13] [13]

de Jonge, N., van der Hooft, J. J. J., and Probst, D. (2025). To Bin or not to Bin: Alternative Representations of Mass Spectra. 10

work page 2025

[14] [14]

P., Laukens, K., and Cuyckens, F

De Vijlder, T., Valkenborg, D., Lemière, F., Romijn, E. P., Laukens, K., and Cuyckens, F. (2018). A tutorial in small molecule identification via electrospray ionization-mass spectrometry: The practical art of structural elucidation. Mass Spectrometry Reviews, 37(5):607–629

work page 2018

[15] [15]

De Waele, G., Wydmuch, M., Waegeman, W., et al. (2026). Small molecule retrieval from tandem mass spectrometry: what are we optimizing for? arXiv preprint arXiv:2602.16507

work page arXiv 2026

[16] [16]

R., Benton, H

Domingo-Almenara, X., Montenegro-Burke, J. R., Benton, H. P., and Siuzdak, G. (2018). Annotation: A Computational Solution for Streamlining Metabolomics Analysis. Analytical Chemistry, 90(1):480–489

work page 2018

[17] [17]

A., Melnik, A

Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V ., Meusel, M., Dorrestein, P. C., Rousu, J., and Böcker, S. (2019). SIRIUS 4: A rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods, 16(4):299–302

work page 2019

[18] [18]

A., Petras, D., Gerwick, W

Dührkop, K., Nothias, L.-F., Fleischauer, M., Reher, R., Ludwig, M., Hoffmann, M. A., Petras, D., Gerwick, W. H., Rousu, J., Dorrestein, P. C., et al. (2021). Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature biotechnology, 39(4):462–471

work page 2021

[19] [19]

Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41):12580–12585

work page 2015

[20] [20]

El Abiead, Y ., Rutz, A., Zuffa, S., Amer, B., Xing, S., Brungs, C., Schmid, R., Correia, M. S. P., Caraballo- Rodriguez, A. M., Zarrinpar, A., Mannochio-Russo, H., Witting, M., Mohanty, I., Pluskal, T., Bittremieux, W., Knight, R., Patterson, A. D., van der Hooft, J. J. J., Böcker, S., Dunn, W. B., Linington, R. G., Wishart, D. S., Wolfender, J.-L., Fieh...

work page 2025

[21] [21]

El Ahmad, T., Brogat-Motte, L., Laforgue, P., and d’Alché Buc, F. (2024). Sketch in, sketch out: Accelerating both learning and inference for structured prediction with kernels. In International conference on artificial intelligence and statistics, pages 109–117. PMLR

work page 2024

[22] [22]

Elser, D., Huber, F., and Gaquerel, E. (2023). Mass2SMILES: deep learning based fast prediction of structures and functional groups directly from high-resolution ms/ms spectra. bioRxiv, pages 2023–07

work page 2023

[23] [23]

Fan, Z., Alley, A., Ghaffari, K., and Ressom, H. W. (2020). MetFID: Artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics, 16(10):104

work page 2020

[24] [24]

Farahani, A., V oghoei, S., Rasheed, K., and Arabnia, H. R. (2021). A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA2020 and IKE 2020, pages 877–894

work page 2021

[25] [25]

Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al

Flamary, R., Vincent-Cuaz, C., Courty, N., Gramfort, A., Kachaiev, O., Tran, H. Q., David, L., Bonet, C., Cassereau, N., Gnassounou, T., et al. (2024). Pot python optimal transport (version 0.9. 5), 2024. URL https://github. com/PythonOT/POT, 10

work page 2024

[26] [26]

J., and Coley, C

Goldman, S., Wohlwend, J., Stražar, M., Haroush, G., Xavier, R. J., and Coley, C. W. (2023a). Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 5(9):965–979

work page

[27] [27]

Goldman, S., Xin, J., Provenzano, J., and Coley, C. W. (2023b). MIST-CF: Chemical formula inference from tandem mass spectra. Journal of Chemical Information and Modeling, 64(7):2421–2431

work page

[28] [28]

Gupta, V ., Qiang, H., Chung, H.-H., Herbst, E., and Skinnider, M. A. (2026). Comprehensive curation and harmonization of small-molecule MS/MS libraries in Spectraverse. Analytical Chemistry, 98(5):3934–3943

work page 2026

[29] [29]

Han, Y ., Wang, P., Yu, K., Chen, X., and Chen, L. (2025). MS-BART: Unified modeling of mass spectra and molecules for structure elucidation. arXiv preprint arXiv:2510.20615

work page arXiv 2025

[30] [30]

Heinonen, M., Shen, H., Zamboni, N., and Rousu, J. (2012). Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics, 28(18):2333–2341

work page 2012

[31] [31]

and Bittremieux, W

Heirman, J. and Bittremieux, W. (2024). Reusability report: annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence, 6(11):1296–1302

work page 2024

[32] [32]

Hong, Y ., Li, S., Ye, Y ., and Tang, H. (2025). FIDDLE: a deep learning method for chemical formulas prediction from tandem mass spectra. Nature Communications, 16(1):11102

work page 2025

[33] [33]

Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). Position: The platonic representation hypothesis. In ICML, pages 20617–20642

work page 2024

[34] [34]

Ji, H., Deng, H., Lu, H., and Zhang, Z. (2020). Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks. Analytical chemistry, 92(13):8649–8653

work page 2020

[35] [35]

Ji, X., Wang, Z., Gao, Z., Zheng, H., Zhang, L., Ke, G., et al. (2024). Uni-Mol2: Exploring molecular pretraining model at scale. arXiv preprint arXiv:2406.14969. 11

work page arXiv 2024

[36] [36]

Kalia, A., Zhou Chen, Y ., Krishnan, D., and Hassoun, S. (2025). JESTR: Joint embedding space tech- nique for ranking candidate molecules for the annotation of untargeted metabolomics data. Bioinformatics, 41(7):btaf354

work page 2025

[37] [37]

A., Thiessen, P

Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B. A., Thiessen, P. A., Yu, B., et al. (2023). PubChem 2023 update. Nucleic acids research, 51(D1):D1373–D1380

work page 2023

[38] [38]

S., Wohlgemuth, G., Barupal, D

Kind, T., Tsugawa, H., Cajka, T., Ma, Y ., Lai, Z., Mehta, S. S., Wohlgemuth, G., Barupal, D. K., Showalter, M. R., Arita, M., and Fiehn, O. (2018). Identification of small molecules using accurate mass MS/MS search. Mass Spectrometry Reviews, 37(4):513–532

work page 2018

[39] [39]

Krzakala, P., Melo, G., Laclau, C., d’Alché Buc, F., and Flamary, R. (2025). The quest for the graph level autoencoder (GRALE). arXiv preprint arXiv:2505.22109

work page arXiv 2025

[40] [40]

Kudriavtseva, P., Kashkinov, M., and Kertész-Farkas, A. (2021). Deep convolutional neural networks help scoring tandem mass spectrometry data in database-searching approaches. Journal of proteome research, 20(10):4708–4717

work page 2021

[41] [41]

Landrum, G. et al. (2013). Rdkit documentation. Release, 1(1-79):4

work page 2013

[42] [42]

LeCun, Y ., Chopra, S., Hadsell, R., Ranzato, M., Huang, F., et al. (2006). A tutorial on energy-based learning. Predicting structured data, 1(0)

work page 2006

[43] [43]

Litsa, E., Chenthamarakshan, V ., Das, P., and Kavraki, L. (2021). Spec2Mol: An end-to-end deep learning framework for translating ms/ms spectra to de-novo molecules. ChemRxiv

work page 2021

[44] [44]

D., Dorrestein, P

Ludwig, M., Broeckling, C. D., Dorrestein, P. C., Dührkop, K., Schymanski, E. L., Böcker, S., and Nothias, L.-F. (2020). Studying charge migration fragmentation of sodiated precursor ions in collision-induced dissociation at the library scale. Journal of the American Society for Mass Spectrometry, 32(1):180–186

work page 2020

[45] [45]

Méndez-Lucio, O., Nicolaou, C., and Earnshaw, B. (2022). MolE: a molecular foundation model for drug discovery. arXiv preprint arXiv:2211.02657

work page arXiv 2022

[46] [46]

H., Nguyen, C

Nguyen, D. H., Nguyen, C. H., and Mamitsuka, H. (2019). Recent advances and prospects of computational methods for metabolite identification: A review with emphasis on machine learning approaches. Briefings in Bioinformatics, 20(6):2028–2043

work page 2019

[47] [47]

and Lampert, C

Nowozin, S. and Lampert, C. H. (2011). Structured prediction and learning in computer vision.Foundations and Trends in Computer Graphics and Vision, 6(3-4):3–4

work page 2011

[48] [48]

and Cuturi, M

Peyré, G. and Cuturi, M. (2019). Computational optimal transport with applications to data sciences. Foundations and Trends® in Machine Learning, 11(5-6):355–607

work page 2019

[49] [49]

Pollmann, J., Bushuiev, R., Bushuiev, A., Pluskal, T., and Huber, F. (2026). Bridging ms2 spectra and chemical space: Advances in spectral similarity, molecular retrieval, and de novo structure discovery. chemrxiv.15000536

work page 2026

[50] [50]

A., Melnik, A

Quinn, R. A., Melnik, A. V ., Vrbanac, A., Fu, T., Patras, K. A., Christy, M. P., Bodai, Z., Belda-Ferre, P., Tripathi, A., Chung, L. K., Downes, M., Welch, R. D., Quinn, M., Humphrey, G., Panitchpakdi, M., Weldon, K. C., Aksenov, A., da Silva, R., Avila-Pacheco, J., Clish, C., Bae, S., Mallick, H., Franzosa, E. A., Lloyd-Price, J., Bussell, R., Thron, T....

work page 2020

[51] [51]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763

work page 2021

[52] [52]

Rakhshaninejad, M., De Waele, G., Jürgens, M., and Waegeman, W. (2026). Reliable molecular retrieval from mass spectra using conformal prediction. bioRxiv, pages 2026–03

work page 2026

[53] [53]

Robinson, J., Chuang, C.-Y ., Sra, S., and Jegelka, S. (2020). Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592

work page arXiv 2020

[54] [54]

Rong, Y ., Bian, Y ., Xu, T., Xie, W., Wei, Y ., Huang, W., and Huang, J. (2020). Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33:12559– 12571

work page 2020

[55] [55]

Roschmann, S., Krzakala, P., Mazelet, S., Bouniot, Q., and Akata, Z. (2026). SOTAlign: Semi-supervised alignment of unimodal vision and language models via optimal transport. arXiv preprint arXiv:2602.23353

work page arXiv 2026

[56] [56]

F., Nowatzky, Y ., Jaeger, C., Parr, M

Russo, F. F., Nowatzky, Y ., Jaeger, C., Parr, M. K., Benner, P., Muth, T., and Lisec, J. (2024). Machine learning methods for compound annotation in non-targeted mass spectrometry—A brief overview of fin- gerprinting, in silico fragmentation and de novo methods. Rapid Communications in Mass Spectrometry, 38(20):e9876. 12

work page 2024

[57] [57]

L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H

Schymanski, E. L., Jeon, J., Gulde, R., Fenner, K., Ruff, M., Singer, H. P., and Hollender, J. (2014). Identi- fying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence. Environmental Science & Technology, 48(4):2097–2098

work page 2014

[58] [58]

A., Dührkop, K., Böcker, S., and Zamboni, N

Stravs, M. A., Dührkop, K., Böcker, S., and Zamboni, N. (2022). MSNovelist: de novo structure generation from mass spectra. Nature Methods, 19(7):865–870

work page 2022

[59] [59]

Thirukovalluru, R., Meng, R., Liu, Y ., Su, M., Nie, P., Yavuz, S., Zhou, Y ., Chen, W., Dhingra, B., et al. (2025). Breaking the batch barrier (B3) of contrastive learning via smart batch mining. arXiv preprint arXiv:2505.11293

work page arXiv 2025

[60] [60]

and Fiehn, O

Vaniya, A. and Fiehn, O. (2022). Revisiting CASMI: Compound ID for 500 new unknowns, using LC-MS/MS data

work page 2022

[61] [61]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30

work page 2017

[62] [62]

K., Villecroze, V ., Cresswell, J

V ouitsis, N., Liu, Z., Gorti, S. K., Villecroze, V ., Cresswell, J. C., Yu, G., Loaiza-Ganem, G., and V olkovs, M. (2024). Data-efficient multimodal fusion on a single GPU. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27239–27251

work page 2024

[63] [63]

and Isola, P

Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR

work page 2020

[64] [64]

Wang, Y ., Chen, X., Liu, L., and Hassoun, S. (2025). MADGEN: Mass-spec attends to de novo molecular generation. arXiv preprint arXiv:2501.01950

work page arXiv 2025

[65] [65]

Wishart, D. S. (2019). Metabolomics for investigating physiological and pathophysiological processes. Physiological Reviews, 99(4):1819–1875

work page 2019

[66] [66]

N., Gomes, J., Geniesse, C., Pappu, A

Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V . (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530

work page 2018

[67] [67]

Xing, S., Shen, S., Xu, B., Li, X., and Huan, T. (2023). BUDDY: Molecular formula discovery via bottom-up MS/MS interrogation. Nature Methods, 20(6):881–890

work page 2023

[68] [68]

and Zhu, J

Xu, R. and Zhu, J. (2025). Unveiling the dark matter of the metabolome: A narrative review of bioinfor- matics tools for LC-HRMS-based compound annotation. Talanta, 295:128327

work page 2025

[69] [69]

expert” models, which can be selected at inference time when the adduct is known. In Table 9 we report the results of these different strategies. In the

Zhang, L., Yang, Q., and Agrawal, A. (2025). Assessing and learning alignment of unimodal vision and language models. In CVPR, pages 14604–14614. 13 A Implementation details All SMILES strings are canonicalized and sanitized using RDKit [ 41]. Chemical formulas and weights are computed by explicitly accounting for implicit hydrogen atoms. The molecular ma...

work page 2025

[70] [70]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page