SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

Alexandra Gutmann; Ant\^onio H. Ribeiro; Danica Kragic; Eunyeong Jin; Farzaneh Taleb; Jonathan Williams; Miguel Vasco; Nona Rajabi; Ziqi Zhang

arxiv: 2605.27009 · v1 · pith:RAHZQX22new · submitted 2026-05-26 · 💻 cs.LG

SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

Ziqi Zhang , Eunyeong Jin , Miguel Vasco , Farzaneh Taleb , Nona Rajabi , Alexandra Gutmann , Jonathan Williams , Ant\^onio H. Ribeiro

show 1 more author

Danica Kragic

This is my paper

Pith reviewed 2026-06-29 19:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords mass spectrometryodor predictioncontrastive learningmulti-modal alignmentolfactory perceptionchemical embeddingsEI-MSodor descriptors

0 comments

The pith

Aligning mass spectra with chemical structure embeddings enables odor prediction from spectra alone at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a contrastive learning approach can transfer perceptual information from molecular structures to mass spectra during training. This matters because practical sensing devices produce spectra but rarely have explicit chemical structures available. The resulting model outperforms spectrum-only baselines and reaches performance levels similar to structure-based methods on multi-label odor descriptor tasks. It also produces representations that track continuous human perceptual ratings more closely and works on real lab spectra.

Core claim

SCENT uses multi-modal contrastive learning to align electron ionization mass spectrometry representations with pretrained chemical structure embeddings, so that only mass spectra are required at inference while still supporting accurate prediction of multi-label odor descriptors at levels comparable to models that receive explicit structure input.

What carries the argument

Spectrum-to-Chemical Embedding alignmeNT (SCENT), a contrastive learning framework that pulls mass-spectra representations toward pretrained structure embeddings and pushes unrelated pairs apart.

If this is right

The spectrum-only model beats standard MS-only baselines on multi-label odor descriptor prediction.
Performance reaches levels comparable to models that receive explicit molecular structure at test time.
The learned representations more closely match continuous human perceptual ratings than baselines.
The approach generalizes to real-world laboratory-measured mass spectra.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Portable or field-deployable sensors could use this alignment to estimate odor properties without needing chemical structure databases at runtime.
The same alignment strategy might transfer to other analytical signals such as infrared spectra or chromatography data for related perceptual or functional predictions.
If the structure embeddings already encode perceptual semantics, the method effectively distills that knowledge into a cheaper input modality.

Load-bearing premise

The contrastive alignment successfully moves perceptual semantic information into the spectrum encoder so that spectra alone become sufficient for accurate odor prediction.

What would settle it

Training the alignment and then testing on a held-out set where the spectrum-only model performs no better than an untrained spectrum baseline on odor descriptor prediction.

Figures

Figures reproduced from arXiv: 2605.27009 by Alexandra Gutmann, Ant\^onio H. Ribeiro, Danica Kragic, Eunyeong Jin, Farzaneh Taleb, Jonathan Williams, Miguel Vasco, Nona Rajabi, Ziqi Zhang.

**Figure 1.** Figure 1: Overview of SCENT. SCENT contrastively aligns EI-MS embeddings with molecular structure embeddings, which are slow and expensive to obtain. At inference, only mass spectra are required. The learned representations are evaluated on three downstream tasks: odor descriptor prediction, human perceptual rating regression, and real-world lab-measured spectra. electron ionization mass spectrometry with a quadrupo… view at source ↗

**Figure 2.** Figure 2: The SCENT framework: (A) A learnable MS encoder and a frozen chemical structure encoder project both modalities into a shared latent space, trained with a multi-modal contrastive loss (Eq. 1); (B) The frozen, aligned MS encoder is evaluated on two downstream tasks: (i) multilabel odor descriptor prediction using a trainable MLP classification head (GS-LF dataset), and (ii) perceptual rating regression usi… view at source ↗

**Figure 3.** Figure 3: Mean per-label ROC-AUC grouped by label frequency (MolFormer-based models). Each bar reports the mean AUC across labels within a given frequency bin (error bars indicate standard error of the mean). SCENT (MolFormer) consistently outperforms EIMS2Vec across all frequency bins, with statistically significant gains even in low-frequency categories. Statistical significance is assessed via the Wilcoxon signed… view at source ↗

**Figure 4.** Figure 4: Test performance vs. training set fraction. Test loss (left) and weighted-AUC (right) are reported for SCENT (OpenPOM), SCENT (MolFormer), and EIMS2Vec across varying fractions of the training set (0.1 to 1.0), as mean ± std over 5 folds. SCENT consistently outperforms EIMS2Vec across all fractions. More results in Appendix E.3. INTENSITY PLEASANTNESS BAKERY SWEET FRUIT FISH GARLIC SPICES COLD SOUR BURNT A… view at source ↗

**Figure 5.** Figure 5: Pearson correlation between predicted and human-rated perceptual scores on the DREAM dataset (MolFormer family). Ridge regression is fitted on frozen embeddings to predict 21 continuous perceptual descriptors (mean ± std across 100 folds). SCENT (MolFormer) significantly outperforms the unaligned baseline and achieves performance comparable to MolFormer despite requiring no molecular structure at inference… view at source ↗

**Figure 6.** Figure 6: Signal pre-processing pipeline for lab-collected spectra. (A) Raw ion intensity is collected as a 2D matrix over m/z channels and time. (B) A sliding-slope algorithm segments the total ion current into background and sample regions. (C) Per-channel baseline correction uses a noise threshold of µbg + 5σbg estimated from the background region. (D) Corrected intensities within the sample window are averaged o… view at source ↗

read the original abstract

Predicting human olfactory perception from molecular structure has seen remarkable progress, yet these approaches require explicit chemical structure at inference, which is not available in practical sensing settings. We address this gap by exploring direct electron ionization mass spectrometry (EI-MS), a sensing technique that acquires chemically informative fragmentation fingerprints in seconds, as an alternative input modality for olfactory prediction. We contribute Spectrum-to-Chemical Embedding alignmeNT (SCENT), a multi-modal contrastive learning framework that aligns EI-MS representations with pretrained chemical structure embeddings, while requiring only mass spectra at inference. On the multi-label odor descriptor prediction task, SCENT significantly outperforms MS-only baselines and achieves performance comparable to structure-based models, despite requiring no explicit molecular structure at test time. The learned representations also better approximate continuous human perceptual ratings and generalize to real-world lab-measured spectra, suggesting that cross-modal alignment is an effective strategy for grounding analytical spectra in chemical semantics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCENT proposes contrastive alignment of mass spectra to structure embeddings for odor prediction from spectra alone, but the abstract gives no metrics or controls to check if perceptual semantics actually transfer.

read the letter

The main point is that this paper identifies the practical limit of structure-based odor models (you need the structure at test time) and tries to fix it by aligning EI-MS spectra to pretrained structure embeddings via contrastive learning. At inference you only need the spectrum. That framing is reasonable and targets a real use case for analytical devices.

What the work does is apply a standard multi-modal contrastive setup to this specific pair of modalities and task. The claim is that the resulting MS representations beat pure MS baselines and reach parity with structure models on multi-label odor descriptor prediction, plus some generalization to lab spectra and better match to continuous ratings.

The abstract states these outcomes but supplies none of the supporting numbers, dataset sizes, baseline descriptions, or training details. Without those it is not possible to assess whether the gains come from the alignment transferring odor-relevant information or from something simpler like improved feature extraction on the spectra side.

The stress-test concern lands: the structure embeddings are pretrained without odor supervision, so the contrastive objective might align on generic chemical or fragmentation patterns rather than perceptual axes. No retrieval metrics, no ablation of the contrastive term, and no check on whether odor-predictive dimensions are preserved in the aligned space. That leaves the central transfer claim unverified from what is shown.

This is for people working on multi-modal methods in chemistry or sensory ML who want to see modality transfer tried on olfaction. A reader already following contrastive alignment papers might skim the full version for the experimental controls, but the current writeup is too thin to cite or build on.

I would send it to peer review. The problem is clearly stated and the method is a direct attempt to address it; if the full paper contains the missing quantitative evidence and ablations, it could be worth referee time even if revisions are needed.

Referee Report

2 major / 1 minor

Summary. The paper introduces SCENT, a contrastive learning framework that aligns EI-MS spectra representations with pretrained molecular structure embeddings. At inference, only mass spectra are required for multi-label odor descriptor prediction. The central claim is that SCENT significantly outperforms MS-only baselines while matching the performance of structure-based models, with additional benefits in approximating continuous perceptual ratings and generalizing to real-world spectra.

Significance. If the alignment successfully transfers odor-relevant perceptual semantics, the work would enable practical sensing applications where molecular structures are unavailable, using rapid EI-MS acquisition. The approach demonstrates a concrete use of cross-modal contrastive learning to ground analytical data in chemical semantics without requiring structure at test time.

major comments (2)

[Abstract] Abstract: the claim that contrastive alignment transfers perceptual semantic information such that spectra alone suffice for accurate odor prediction is not supported by any reported cross-modal retrieval metrics, embedding-space odor correlation analysis, or ablation removing the contrastive term; without these, outperformance over MS baselines could arise from generic chemical similarity rather than olfactory transfer.
[Abstract] The weakest assumption (that structure embeddings encode perceptual rather than purely structural features and that the alignment objective prioritizes perceptual axes) is load-bearing for the parity-with-structure-models claim, yet no quantitative evidence is provided to rule out that the MS embeddings improve baselines without carrying the claimed perceptual signal.

minor comments (1)

[Abstract] Abstract: dataset sizes, number of odor descriptors, specific baselines, and exact performance metrics (e.g., F1, mAP) are omitted, which hinders immediate assessment of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence supporting the perceptual transfer claims. We address each major comment below and commit to revisions that directly test the alignment's role in transferring olfactory semantics.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that contrastive alignment transfers perceptual semantic information such that spectra alone suffice for accurate odor prediction is not supported by any reported cross-modal retrieval metrics, embedding-space odor correlation analysis, or ablation removing the contrastive term; without these, outperformance over MS baselines could arise from generic chemical similarity rather than olfactory transfer.

Authors: We agree that the current manuscript lacks these direct diagnostics and that outperformance alone does not conclusively isolate perceptual transfer from generic chemical similarity. In the revised version we will add: (i) cross-modal retrieval metrics (spectrum-to-molecule and molecule-to-spectrum recall@K), (ii) embedding-space analysis correlating aligned MS vectors with odor descriptor labels, and (iii) an ablation that trains an identical architecture without the contrastive term and reports the resulting drop in odor-descriptor F1. These additions will quantify whether the contrastive objective specifically aligns perceptual axes. revision: yes
Referee: [Abstract] The weakest assumption (that structure embeddings encode perceptual rather than purely structural features and that the alignment objective prioritizes perceptual axes) is load-bearing for the parity-with-structure-models claim, yet no quantitative evidence is provided to rule out that the MS embeddings improve baselines without carrying the claimed perceptual signal.

Authors: We acknowledge that the manuscript does not yet provide quantitative evidence ruling out purely structural transfer. We will add two analyses in revision: (1) direct correlation between the pretrained structure embeddings and continuous human perceptual ratings (e.g., Pearson r on intensity or pleasantness), and (2) a controlled comparison showing that odor-prediction performance of the aligned MS embeddings exceeds that of non-contrastively trained MS embeddings by a margin comparable to the structure-model gap. These results will test whether the alignment objective preferentially captures perceptual rather than generic structural dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical contrastive alignment claims rest on reported experiments, not definitional reduction

full rationale

The paper describes a standard multi-modal contrastive framework (SCENT) that aligns EI-MS representations to pretrained structure embeddings and then evaluates odor descriptor prediction empirically. No equations, training objectives, or performance metrics are shown to reduce by construction to the inputs (e.g., no fitted parameter renamed as prediction, no self-definitional loop, no load-bearing self-citation chain). The central claim—that alignment transfers perceptual information—is presented as an experimental outcome rather than a mathematical identity. This is the most common honest finding for an applied ML paper whose results are benchmark-driven.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5716 in / 986 out tokens · 24189 ms · 2026-06-29T19:12:05.888703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

[1]

D. Feng, C. Li, W. Dai, and P. P. Liang. Smellnet: A large-scale dataset for real-world smell recognition.arXiv preprint arXiv:2506.00239,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Sanchez-Lengeling, J

B. Sanchez-Lengeling, J. N. Wei, B. K. Lee, R. C. Gerkin, A. Aspuru-Guzik, and A. B. Wiltschko. Machine learning for scent: Learning generalizable perceptual representations of small molecules. arXiv preprint arXiv:1910.10685,

work page arXiv 1910
[3]

Detailed model architecture, objectives, and hyperparameters are provided in Section 3 and Appendix C

11 A SCENT workflow This section summarizes the experimental workflow. Detailed model architecture, objectives, and hyperparameters are provided in Section 3 and Appendix C. Data division.We filter GS-LF dataset with molecules weight in 50-300 Da, remains 2,588 molecule. We first split the 2,588 molecule-spectrum pairs with valid MS spectra into a fixed 1...

2023
[4]

(*p < .05). F Additional results: human rating regression F.1 Statistical test results of Pearsonr To better understand the statistical significance of the perceptual regression results, we apply the Wilcoxon signed-rank test (Wilcoxon, 1992), a non-parametric paired test, to each label’s Pearson’s r vector (n= 21 ) in 100 folds cross-validation. The non-...

1992

[1] [1]

D. Feng, C. Li, W. Dai, and P. P. Liang. Smellnet: A large-scale dataset for real-world smell recognition.arXiv preprint arXiv:2506.00239,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Sanchez-Lengeling, J

B. Sanchez-Lengeling, J. N. Wei, B. K. Lee, R. C. Gerkin, A. Aspuru-Guzik, and A. B. Wiltschko. Machine learning for scent: Learning generalizable perceptual representations of small molecules. arXiv preprint arXiv:1910.10685,

work page arXiv 1910

[3] [3]

Detailed model architecture, objectives, and hyperparameters are provided in Section 3 and Appendix C

11 A SCENT workflow This section summarizes the experimental workflow. Detailed model architecture, objectives, and hyperparameters are provided in Section 3 and Appendix C. Data division.We filter GS-LF dataset with molecules weight in 50-300 Da, remains 2,588 molecule. We first split the 2,588 molecule-spectrum pairs with valid MS spectra into a fixed 1...

2023

[4] [4]

(*p < .05). F Additional results: human rating regression F.1 Statistical test results of Pearsonr To better understand the statistical significance of the perceptual regression results, we apply the Wilcoxon signed-rank test (Wilcoxon, 1992), a non-parametric paired test, to each label’s Pearson’s r vector (n= 21 ) in 100 folds cross-validation. The non-...

1992