SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Esben Jannik Bjerrum · 2017 · cs.LG · arXiv 1703.07076

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

When and How to Canonize: A Generalization Perspective

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Canonization produces generalization bounds ranging from invariant-optimal to non-invariant depending on regularity, with Hilbert-curve ordering proven to give polynomial covering-number growth for point clouds while lexicographic sorting gives exponential growth.

Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.

Toxicity Prediction by Multimodal Deep Learning

cs.LG · 2019-07-19 · unverdicted · novelty 5.0

A multimodal deep learning approach using heterogeneous representations and network types achieves significantly higher accuracy than state-of-the-art methods on a standard toxicity benchmark.

SMolLM: Small Language Models Learn Small Molecular Grammar

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A 53K-parameter weight-shared transformer generates novel valid SMILES at 95% rate on ZINC-250K and resolves constraints hierarchically via bracket, ring, and valence stages as shown by probing and ablation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Toxicity Prediction by Multimodal Deep Learning cs.LG · 2019-07-19 · unverdicted · none · ref 1 · internal anchor
A multimodal deep learning approach using heterogeneous representations and network types achieves significantly higher accuracy than state-of-the-art methods on a standard toxicity benchmark.

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer