CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

Chixiang Lu; Haibo Jiang; Hengyu Zhang; Jing Hao; Lifei Wang; Tianbo Liu; Xiaojuan Qi

arxiv: 2605.12980 · v1 · pith:U2NLTI3Vnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

Tianbo Liu , Chixiang Lu , Jing Hao , Hengyu Zhang , Lifei Wang , Haibo Jiang , Xiaojuan Qi This is my paper

Pith reviewed 2026-05-14 20:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords spectrum-to-structure generationtandem mass spectrafingerprint corruptionde novo molecular generationautoregressive decodingSELFIES representationchemical constraintsmolecular structure elucidation

0 comments

The pith

CoRe-Gen generates molecular structures from mass spectra by training decoders on frequency-aware corrupted fingerprints to match real prediction noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoRe-Gen to improve de novo molecular structure generation from tandem mass spectra under imperfect fingerprint conditions. It identifies a mismatch where training uses clean fingerprints but deployment uses noisy predictions from spectrum-to-fingerprint models, leading to errors especially in long-tail substructures. To address this, CoRe-Gen pretrains the encoder on synthetic spectra, applies frequency-aware fingerprint corruption during decoder training, and uses structure-aware autoregressive decoding with compositional SELFIES, auxiliary supervision, and chemical constraints. This approach establishes new state-of-the-art exact-match accuracies on the NPLIB1 benchmark while staying competitive on MassSpecGym. The method keeps the efficiency of autoregressive decoding for practical use in spectrum-to-structure tasks.

Core claim

CoRe-Gen closes the condition mismatch in spectrum-to-structure pipelines by pretraining the spectrum encoder on synthetic spectra, training the decoder with frequency-aware fingerprint corruption to simulate prediction noise, and applying structure-aware autoregressive decoding using compositional SELFIES representations with auxiliary structural supervision and lightweight chemical constraints. This produces higher top-1 and top-10 exact-match accuracies on standard benchmarks.

What carries the argument

Frequency-aware fingerprint corruption applied during decoder training to reproduce the structured errors from real spectrum-to-fingerprint predictors.

If this is right

Enables reliable de novo generation beyond database coverage by handling noisy inputs.
Preserves the efficiency advantages of autoregressive decoding for scalable applications.
Improves performance particularly on long-tail substructures that are prone to prediction errors.
Allows integration with large-scale molecular corpora for fingerprint-to-structure decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This noise-modeling technique could be adapted to other generative tasks with imperfect intermediate predictions, such as in natural language processing pipelines.
Exploring predictor-specific corruption patterns rather than frequency-based might further close the train-test gap.
The compositional SELFIES approach opens doors to incorporating more domain knowledge into autoregressive molecular generators.

Load-bearing premise

The frequency-aware corruption accurately reproduces the structured errors that arise from real spectrum-to-fingerprint predictors at deployment.

What would settle it

Evaluating the model using fingerprints actually predicted by a spectrum-to-fingerprint model on real spectra and comparing the accuracy to results obtained with the synthetic corruption method.

Figures

Figures reproduced from arXiv: 2605.12980 by Chixiang Lu, Haibo Jiang, Hengyu Zhang, Jing Hao, Lifei Wang, Tianbo Liu, Xiaojuan Qi.

**Figure 2.** Figure 2: Overview of CoRe-Gen: improving, matching, and exploiting imperfect fingerprint [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Long-tail distribution of active finger [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Representative successful generation examples of CoRe-Gen. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoRe-Gen gives a concrete three-part recipe to close the clean-to-noisy fingerprint gap in spectrum-to-structure generation and shows measurable gains on NPLIB1.

read the letter

The paper's main contribution is a targeted fix for the deployment mismatch: models trained on perfect fingerprints have to run on noisy predictions from real spectrum-to-fingerprint pipelines. CoRe-Gen handles this with synthetic-spectrum pretraining of the encoder, frequency-aware corruption during decoder training, and compositional SELFIES autoregressive decoding plus auxiliary structural supervision and light chemical constraints. That combination is new enough to stand out from earlier two-stage pipelines that did not explicitly model the noise at test time. The reported numbers—19.54% top-1 and 29.92% top-10 exact match on NPLIB1, staying competitive on the harder MassSpecGym set—look like a practical step forward while preserving autoregressive speed. The work is empirical and uses external benchmarks, so there is no obvious circularity in the claims. The soft spot is the central assumption that frequency-aware corruption reproduces the structured, substructure-biased errors that real predictors actually make. The abstract gives no quantitative check such as error histograms or divergence between synthetic and real residuals, which leaves open the chance that some of the lift comes from benchmark-specific tuning rather than broad robustness. If the full paper supplies that validation, the case strengthens; otherwise it is a clear point for reviewers to probe. This is aimed at people working on de novo structure elucidation from MS/MS data in metabolomics and natural-product chemistry. A reader who needs a ready-to-try recipe for noisy fingerprint conditions would get direct value from the details. I would send it to peer review—the idea is grounded, the benchmarks are standard, and the gap it addresses is real even if the noise-model validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CoRe-Gen for de novo molecular structure elucidation from tandem mass spectra. It decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding and addresses the train-test condition mismatch through synthetic-spectrum pretraining of the encoder, frequency-aware fingerprint corruption during decoder training, and structure-aware autoregressive decoding that incorporates compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. The work reports new state-of-the-art exact-match accuracies on the NPLIB1 benchmark (19.54% Top-1, 29.92% Top-10) while remaining competitive on the more challenging MassSpecGym benchmark.

Significance. If the robustness gains are shown to arise from genuine mismatch mitigation rather than an optimistic noise model, the approach would constitute a practical advance for scalable spectrum-to-structure generation. The preservation of autoregressive decoding efficiency together with explicit handling of long-tail substructures and auxiliary supervision are concrete strengths that could influence downstream metabolomics applications.

major comments (2)

[Methods (fingerprint corruption and decoder training)] The central claim that frequency-aware fingerprint corruption during decoder training accurately reproduces the structured, substructure-biased errors of real spectrum-to-fingerprint predictors is load-bearing for the robustness argument, yet the manuscript supplies no quantitative validation (per-substructure error-rate histograms, KL divergence between synthetic and real residuals, or similar) comparing the synthetic corruption schedule to actual predictor error distributions. Without this check, it remains possible that the reported NPLIB1 gains are driven by an optimistic noise model rather than genuine robustness.
[§4 Experiments] §4 (Experiments): the abstract states clear numerical improvements on standard benchmarks, but the manuscript must provide explicit controls for data leakage, confirmation that the corruption schedule was not tuned post-hoc on the test distribution, and full baseline implementation details to substantiate the SOTA claim on NPLIB1.

minor comments (2)

[Methods] Notation for the compositional SELFIES representation and the auxiliary supervision losses should be introduced with explicit equations or pseudocode in the methods section for reproducibility.
[Figures and Tables] Table captions and axis labels in the benchmark result figures should explicitly state whether reported accuracies are exact-match or relaxed-match to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We agree that quantitative validation of the corruption model and additional experimental controls are needed to fully support the robustness claims. We will revise the manuscript to incorporate these elements while preserving the core contributions.

read point-by-point responses

Referee: [Methods (fingerprint corruption and decoder training)] The central claim that frequency-aware fingerprint corruption during decoder training accurately reproduces the structured, substructure-biased errors of real spectrum-to-fingerprint predictors is load-bearing for the robustness argument, yet the manuscript supplies no quantitative validation (per-substructure error-rate histograms, KL divergence between synthetic and real residuals, or similar) comparing the synthetic corruption schedule to actual predictor error distributions. Without this check, it remains possible that the reported NPLIB1 gains are driven by an optimistic noise model rather than genuine robustness.

Authors: We acknowledge that the current manuscript does not include direct quantitative comparisons between the frequency-aware corruption schedule and real predictor error distributions. In the revision we will add per-substructure error-rate histograms, KL-divergence measurements, and residual distribution plots computed on held-out validation spectra. These analyses will be performed using the same spectrum-to-fingerprint model that generates the noisy fingerprints at inference time, thereby demonstrating that the synthetic corruption reproduces the observed structured, substructure-biased errors rather than relying on an optimistic noise model. revision: yes
Referee: [§4 Experiments] §4 (Experiments): the abstract states clear numerical improvements on standard benchmarks, but the manuscript must provide explicit controls for data leakage, confirmation that the corruption schedule was not tuned post-hoc on the test distribution, and full baseline implementation details to substantiate the SOTA claim on NPLIB1.

Authors: We will expand §4 with (i) explicit statements and verification that training, validation, and test molecule sets are disjoint at the InChI level, (ii) a clear description that the corruption schedule hyperparameters were selected exclusively on the validation split with no access to test data, and (iii) complete baseline implementation details including model architectures, training hyperparameters, and public code references. These additions will be placed in a new subsection on experimental rigor and will be accompanied by supplementary tables listing all hyperparameter choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline with external benchmarks

full rationale

The paper describes an empirical architecture (synthetic pretraining, frequency-aware corruption, autoregressive decoding with SELFIES and auxiliary supervision) evaluated on external benchmarks NPLIB1 and MassSpecGym. No equations, derivations, or fitted parameters are presented that reduce the reported accuracies to quantities defined by the same inputs. The central claims rest on experimental results rather than self-referential definitions or self-citation chains that would force the outcome by construction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard supervised learning assumptions plus domain assumptions about SELFIES validity and the statistical distribution of fingerprint errors; no new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption Standard neural network training converges to a useful mapping from spectra to fingerprints and from fingerprints to structures.
Invoked implicitly throughout the training procedure described in the abstract.
domain assumption Frequency statistics of substructures in the training corpus are representative of the error patterns that real predictors will produce at test time.
Basis for the frequency-aware corruption step.

pith-pipeline@v0.9.0 · 5553 in / 1290 out tokens · 32314 ms · 2026-05-14T20:11:41.233523+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Martin Alberts, Oliver Schilter, Fabio Zipoli, et al. 2024. Unraveling molecular structure: A multimodal spectroscopic dataset for chemistry.Advances in Neural Information Processing Systems, 37:125780–125808

work page 2024
[2]

Felix Allen, Allison Pon, Michael Wilson, et al. 2014. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra.Nucleic Acids Research, 42(W1):W94–W99

work page 2014
[3]

Liu Cao, Mustafa Guler, Azat Tagirdzhanov, et al. 2021. MolDiscovery: learning mass spec- trometry fragmentation of small molecules.Nature Communications, 12(1):3718

work page 2021
[4]

Thomas Butler, Abraham Frandsen, Rose Lightheart, et al. 2023. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra

work page 2023
[5]

Marvin Bohde, Mukund Manjrekar, Ruibin Wang, et al. 2025. DiffMS: Diffusion generation of molecules conditioned on mass spectra.arXiv preprint arXiv:2502.09571

work page arXiv 2025
[6]

Celine Brouard, Huibin Shen, Kai Dührkop, et al. 2016. Fast metabolite identification with input output kernel regression.Bioinformatics, 32(12):i28–i36

work page 2016
[7]

de Jonge, et al

Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, et al. 2024. MassSpecGym: A benchmark for the discovery and identification of molecules.Advances in Neural Information Processing Systems, 37:110010–110027

work page 2024
[8]

de Jonge, Joris J

Niek F. de Jonge, Joris J. R. Louwen, Elena Chekmeneva, et al. 2023. MS2Query: reliable and scalable MS2 mass spectra-based analogue search.Nature Communications, 14(1):1752

work page 2023
[9]

Kai Dührkop, Huibin Shen, Marvin Meusel, et al. 2015. Searching molecular structure databases with tandem mass spectra using CSI:FingerID.Proceedings of the National Academy of Sciences, 112(41):12580–12585

work page 2015
[10]

Kai Dührkop, Marcus Fleischauer, Moritz Ludwig, et al. 2019. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information.Nature Methods, 16(4):299–302

work page 2019
[11]

Samuel Goldman, Jiayi Xin, Justin Provenzano, et al. 2023. MIST-CF: Chemical formula infer- ence from tandem mass spectra.Journal of Chemical Information and Modeling, 64(7):2421– 2431

work page 2023
[12]

Samuel Goldman, Janet Li, and Connor W. Coley. 2024. Generating molecular fragmentation graphs with autoregressive neural networks.Analytical Chemistry, 96(8):3419–3428

work page 2024
[13]

Samuel Goldman, John Bradshaw, Jiayi Xin, et al. 2023. Prefix-tree decoding for predicting mass spectra from molecules.Advances in Neural Information Processing Systems, 36:48548–48572

work page 2023
[14]

Rafael Montenegro-Burke, Xavier Domingo-Almenara, et al

Carlos Guijas, J. Rafael Montenegro-Burke, Xavier Domingo-Almenara, et al. 2018. METLIN: a technology platform for identifying knowns and unknowns.Analytical Chemistry, 90(5):3156– 3164

work page 2018
[15]

Yang Han, Pengyu Wang, Kai Yu, et al. 2025. MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation.arXiv preprint arXiv:2510.20615

work page arXiv 2025
[16]

Florian Huber, Sven van der Burg, Justin J. J. van der Hooft, et al. 2021. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra.Journal of Cheminformatics, 13(1):84

work page 2021
[17]

Hisayuki Horai, Masanori Arita, Shigehiko Kanaya, et al. 2010. MassBank: a public repository for sharing mass spectral data for life sciences.Journal of Mass Spectrometry, 45(7):703–714

work page 2010
[18]

Hongchao Ji, Hanzi Deng, Hongmei Lu, et al. 2020. Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks.Analytical Chemistry, 92(13):8649–8653. 10

work page 2020
[19]

Mario Krenn, Florian Häse, AkshatKumar Nigam, et al. 2019. SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry.arXiv preprint arXiv:1905.13741, 1(3)

work page arXiv 2019
[20]

Yuanyue Li, Tobias Kind, Jacob Folz, et al. 2021. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification.Nature Methods, 18(12):1524– 1531

work page 2021
[21]

Litsa, Vijil Chenthamarakshan, Payel Das, et al

Eleni E. Litsa, Vijil Chenthamarakshan, Payel Das, et al. 2023. An end-to-end deep learning framework for translating mass spectra to de-novo molecules.Communications Chemistry, 6(1):132

work page 2023
[22]

Yin Liu, Xiangru Zhang, Wenyuan Zhao, et al. 2023. De novo molecular structure generation from mass spectra. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 373–378. IEEE

work page 2023
[23]

H. L. Morgan. 1965. The generation of a unique machine description for chemical structures– a technique developed at chemical abstracts service.Journal of Chemical Documentation, 5(2):107–113

work page 1965
[24]

Danila Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, et al. 2020. Molec- ular sets (MOSES): a benchmarking platform for molecular generation models.Frontiers in Pharmacology, 11:565644

work page 2020
[25]

Schymanski, Sebastian Wolf, et al

Christoph Ruttkies, Emma L. Schymanski, Sebastian Wolf, et al. 2016. MetFrag relaunched: incorporating strategies beyond in silico fragmentation.Journal of Cheminformatics, 8(1):3

work page 2016
[26]

Kerstin Scheubert, Franziska Hufsky, and Sebastian Böcker. 2013. Computational mass spec- trometry for small molecules.Journal of Cheminformatics, 5(1):12

work page 2013
[27]

Schymanski and Steffen Neumann

Emma L. Schymanski and Steffen Neumann. 2013. The critical assessment of small molecule identification (CASMI): challenges and solutions.Metabolites, 3(3):517–538

work page 2013
[28]

Schymanski, Christoph Ruttkies, Martin Krauss, et al

Emma L. Schymanski, Christoph Ruttkies, Martin Krauss, et al. 2017. Critical assessment of small molecule identification 2016: automated methods.Journal of Cheminformatics, 9(1):22

work page 2017
[29]

Maria Sorokina, Polina Merseburger, Karthikeyan Rajan, et al. 2021. COCONUT online: collection of open natural products database.Journal of Cheminformatics, 13(1):2

work page 2021
[30]

Skinnider, Fei Wang, Daniel Pasin, et al

Michael A. Skinnider, Fei Wang, Daniel Pasin, et al. 2021. A deep generative model enables automated structure elucidation of novel psychoactive substances.Nature Machine Intelligence, 3(11):973–984

work page 2021
[31]

Stephen Stein. 2012. Mass spectral reference libraries: an ever-expanding resource for chemical identification

work page 2012
[32]

Stravs, Kai Dührkop, Sebastian Böcker, et al

Michael A. Stravs, Kai Dührkop, Sebastian Böcker, et al. 2022. MSNovelist: de novo structure generation from mass spectra.Nature Methods, 19(7):865–870

work page 2022
[33]

Tuan Le, Robin Winter, Frank Noé, et al. 2020. Neuraldecipher–reverse-engineering extended- connectivity fingerprints (ECFPs) to their molecular structures.Chemical Science, 11(38):10378– 10389

work page 2020
[34]

Ucak, Ikuo Ashyrmamatov, and Juyong Lee

Umut V . Ucak, Ikuo Ashyrmamatov, and Juyong Lee. 2023. Reconstruction of lossless molecular representations from fingerprints.Journal of Cheminformatics, 15(1):26

work page 2023
[35]

Carl Fredrik Michelsen. 2016. Sharing and community curation of mass spectrometry data with GNPS

work page 2016
[36]

Fei Wang, Dana Allen, Siyang Tian, et al. 2022. CFM-ID 4.0–a web server for accurate MS-based metabolite identification.Nucleic Acids Research, 50(W1):W165–W174

work page 2022
[37]

Yuxuan Wang, Xinyu Chen, Lihang Liu, et al. 2025. MADGEN: Mass-Spec attends to De Novo molecular generation.arXiv preprint arXiv:2501.01950. 11

work page arXiv 2025
[38]

Wei, David Belanger, Ryan P

Jennifer N. Wei, David Belanger, Ryan P. Adams, et al. 2019. Rapid prediction of electron– ionization mass spectrometry using neural networks.ACS Central Science, 5(4):700–708

work page 2019
[39]

Wishart, An Chi Guo, Elvis Oler, et al

David S. Wishart, An Chi Guo, Elvis Oler, et al. 2022. HMDB 5.0: the human metabolome database for 2022.Nucleic Acids Research, 50(D1):D622–D631

work page 2022
[40]

Sebastian Wolf, Stephan Schmidt, Matthias Müller-Hannemann, et al. 2010. In silico fragmen- tation for computer assisted identification of metabolite mass spectra.BMC Bioinformatics, 11(1):148

work page 2010
[41]

Florian Huber, Lars Ridder, Sebastiaan Verhoeven, et al. 2021. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships.PLOS Computational Biology, 17(2):e1008724

work page 2021
[42]

Adamo Young, Hannes Röst, and Bo Wang. 2024. Tandem mass spectrum prediction for small molecules using graph transformers.Nature Machine Intelligence, 6(4):404–416

work page 2024
[43]

Hong Zhang, Qiong Yang, Tianyu Xie, et al. 2024. MSBERT: embedding tandem mass spectra into chemically rational space by mask learning and contrastive learning.Analytical Chemistry, 96(42):16599–16608. 12

work page 2024

[1] [1]

Martin Alberts, Oliver Schilter, Fabio Zipoli, et al. 2024. Unraveling molecular structure: A multimodal spectroscopic dataset for chemistry.Advances in Neural Information Processing Systems, 37:125780–125808

work page 2024

[2] [2]

Felix Allen, Allison Pon, Michael Wilson, et al. 2014. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra.Nucleic Acids Research, 42(W1):W94–W99

work page 2014

[3] [3]

Liu Cao, Mustafa Guler, Azat Tagirdzhanov, et al. 2021. MolDiscovery: learning mass spec- trometry fragmentation of small molecules.Nature Communications, 12(1):3718

work page 2021

[4] [4]

Thomas Butler, Abraham Frandsen, Rose Lightheart, et al. 2023. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra

work page 2023

[5] [5]

Marvin Bohde, Mukund Manjrekar, Ruibin Wang, et al. 2025. DiffMS: Diffusion generation of molecules conditioned on mass spectra.arXiv preprint arXiv:2502.09571

work page arXiv 2025

[6] [6]

Celine Brouard, Huibin Shen, Kai Dührkop, et al. 2016. Fast metabolite identification with input output kernel regression.Bioinformatics, 32(12):i28–i36

work page 2016

[7] [7]

de Jonge, et al

Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, et al. 2024. MassSpecGym: A benchmark for the discovery and identification of molecules.Advances in Neural Information Processing Systems, 37:110010–110027

work page 2024

[8] [8]

de Jonge, Joris J

Niek F. de Jonge, Joris J. R. Louwen, Elena Chekmeneva, et al. 2023. MS2Query: reliable and scalable MS2 mass spectra-based analogue search.Nature Communications, 14(1):1752

work page 2023

[9] [9]

Kai Dührkop, Huibin Shen, Marvin Meusel, et al. 2015. Searching molecular structure databases with tandem mass spectra using CSI:FingerID.Proceedings of the National Academy of Sciences, 112(41):12580–12585

work page 2015

[10] [10]

Kai Dührkop, Marcus Fleischauer, Moritz Ludwig, et al. 2019. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information.Nature Methods, 16(4):299–302

work page 2019

[11] [11]

Samuel Goldman, Jiayi Xin, Justin Provenzano, et al. 2023. MIST-CF: Chemical formula infer- ence from tandem mass spectra.Journal of Chemical Information and Modeling, 64(7):2421– 2431

work page 2023

[12] [12]

Samuel Goldman, Janet Li, and Connor W. Coley. 2024. Generating molecular fragmentation graphs with autoregressive neural networks.Analytical Chemistry, 96(8):3419–3428

work page 2024

[13] [13]

Samuel Goldman, John Bradshaw, Jiayi Xin, et al. 2023. Prefix-tree decoding for predicting mass spectra from molecules.Advances in Neural Information Processing Systems, 36:48548–48572

work page 2023

[14] [14]

Rafael Montenegro-Burke, Xavier Domingo-Almenara, et al

Carlos Guijas, J. Rafael Montenegro-Burke, Xavier Domingo-Almenara, et al. 2018. METLIN: a technology platform for identifying knowns and unknowns.Analytical Chemistry, 90(5):3156– 3164

work page 2018

[15] [15]

Yang Han, Pengyu Wang, Kai Yu, et al. 2025. MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation.arXiv preprint arXiv:2510.20615

work page arXiv 2025

[16] [16]

Florian Huber, Sven van der Burg, Justin J. J. van der Hooft, et al. 2021. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra.Journal of Cheminformatics, 13(1):84

work page 2021

[17] [17]

Hisayuki Horai, Masanori Arita, Shigehiko Kanaya, et al. 2010. MassBank: a public repository for sharing mass spectral data for life sciences.Journal of Mass Spectrometry, 45(7):703–714

work page 2010

[18] [18]

Hongchao Ji, Hanzi Deng, Hongmei Lu, et al. 2020. Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks.Analytical Chemistry, 92(13):8649–8653. 10

work page 2020

[19] [19]

Mario Krenn, Florian Häse, AkshatKumar Nigam, et al. 2019. SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry.arXiv preprint arXiv:1905.13741, 1(3)

work page arXiv 2019

[20] [20]

Yuanyue Li, Tobias Kind, Jacob Folz, et al. 2021. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification.Nature Methods, 18(12):1524– 1531

work page 2021

[21] [21]

Litsa, Vijil Chenthamarakshan, Payel Das, et al

Eleni E. Litsa, Vijil Chenthamarakshan, Payel Das, et al. 2023. An end-to-end deep learning framework for translating mass spectra to de-novo molecules.Communications Chemistry, 6(1):132

work page 2023

[22] [22]

Yin Liu, Xiangru Zhang, Wenyuan Zhao, et al. 2023. De novo molecular structure generation from mass spectra. In2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 373–378. IEEE

work page 2023

[23] [23]

H. L. Morgan. 1965. The generation of a unique machine description for chemical structures– a technique developed at chemical abstracts service.Journal of Chemical Documentation, 5(2):107–113

work page 1965

[24] [24]

Danila Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, et al. 2020. Molec- ular sets (MOSES): a benchmarking platform for molecular generation models.Frontiers in Pharmacology, 11:565644

work page 2020

[25] [25]

Schymanski, Sebastian Wolf, et al

Christoph Ruttkies, Emma L. Schymanski, Sebastian Wolf, et al. 2016. MetFrag relaunched: incorporating strategies beyond in silico fragmentation.Journal of Cheminformatics, 8(1):3

work page 2016

[26] [26]

Kerstin Scheubert, Franziska Hufsky, and Sebastian Böcker. 2013. Computational mass spec- trometry for small molecules.Journal of Cheminformatics, 5(1):12

work page 2013

[27] [27]

Schymanski and Steffen Neumann

Emma L. Schymanski and Steffen Neumann. 2013. The critical assessment of small molecule identification (CASMI): challenges and solutions.Metabolites, 3(3):517–538

work page 2013

[28] [28]

Schymanski, Christoph Ruttkies, Martin Krauss, et al

Emma L. Schymanski, Christoph Ruttkies, Martin Krauss, et al. 2017. Critical assessment of small molecule identification 2016: automated methods.Journal of Cheminformatics, 9(1):22

work page 2017

[29] [29]

Maria Sorokina, Polina Merseburger, Karthikeyan Rajan, et al. 2021. COCONUT online: collection of open natural products database.Journal of Cheminformatics, 13(1):2

work page 2021

[30] [30]

Skinnider, Fei Wang, Daniel Pasin, et al

Michael A. Skinnider, Fei Wang, Daniel Pasin, et al. 2021. A deep generative model enables automated structure elucidation of novel psychoactive substances.Nature Machine Intelligence, 3(11):973–984

work page 2021

[31] [31]

Stephen Stein. 2012. Mass spectral reference libraries: an ever-expanding resource for chemical identification

work page 2012

[32] [32]

Stravs, Kai Dührkop, Sebastian Böcker, et al

Michael A. Stravs, Kai Dührkop, Sebastian Böcker, et al. 2022. MSNovelist: de novo structure generation from mass spectra.Nature Methods, 19(7):865–870

work page 2022

[33] [33]

Tuan Le, Robin Winter, Frank Noé, et al. 2020. Neuraldecipher–reverse-engineering extended- connectivity fingerprints (ECFPs) to their molecular structures.Chemical Science, 11(38):10378– 10389

work page 2020

[34] [34]

Ucak, Ikuo Ashyrmamatov, and Juyong Lee

Umut V . Ucak, Ikuo Ashyrmamatov, and Juyong Lee. 2023. Reconstruction of lossless molecular representations from fingerprints.Journal of Cheminformatics, 15(1):26

work page 2023

[35] [35]

Carl Fredrik Michelsen. 2016. Sharing and community curation of mass spectrometry data with GNPS

work page 2016

[36] [36]

Fei Wang, Dana Allen, Siyang Tian, et al. 2022. CFM-ID 4.0–a web server for accurate MS-based metabolite identification.Nucleic Acids Research, 50(W1):W165–W174

work page 2022

[37] [37]

Yuxuan Wang, Xinyu Chen, Lihang Liu, et al. 2025. MADGEN: Mass-Spec attends to De Novo molecular generation.arXiv preprint arXiv:2501.01950. 11

work page arXiv 2025

[38] [38]

Wei, David Belanger, Ryan P

Jennifer N. Wei, David Belanger, Ryan P. Adams, et al. 2019. Rapid prediction of electron– ionization mass spectrometry using neural networks.ACS Central Science, 5(4):700–708

work page 2019

[39] [39]

Wishart, An Chi Guo, Elvis Oler, et al

David S. Wishart, An Chi Guo, Elvis Oler, et al. 2022. HMDB 5.0: the human metabolome database for 2022.Nucleic Acids Research, 50(D1):D622–D631

work page 2022

[40] [40]

Sebastian Wolf, Stephan Schmidt, Matthias Müller-Hannemann, et al. 2010. In silico fragmen- tation for computer assisted identification of metabolite mass spectra.BMC Bioinformatics, 11(1):148

work page 2010

[41] [41]

Florian Huber, Lars Ridder, Sebastiaan Verhoeven, et al. 2021. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships.PLOS Computational Biology, 17(2):e1008724

work page 2021

[42] [42]

Adamo Young, Hannes Röst, and Bo Wang. 2024. Tandem mass spectrum prediction for small molecules using graph transformers.Nature Machine Intelligence, 6(4):404–416

work page 2024

[43] [43]

Hong Zhang, Qiong Yang, Tianyu Xie, et al. 2024. MSBERT: embedding tandem mass spectra into chemically rational space by mask learning and contrastive learning.Analytical Chemistry, 96(42):16599–16608. 12

work page 2024