Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning

Takayuki Kimura

arxiv: 2605.16823 · v1 · pith:YKQZDIQ3new · submitted 2026-05-16 · 💻 cs.LG

Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning

Takayuki Kimura This is my paper

Pith reviewed 2026-05-19 20:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords molecular representation learningvector quantizationgraph neural networkssemantic discretizationprotein-ligand interactionmolecular tokenizationtransformer pretrainingdrug discovery

0 comments

The pith

Vector quantization on atom embeddings yields discrete tokens for chemical contexts that boost protein-ligand prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VQ-Atom to convert continuous atom-level embeddings from graph neural networks into discrete tokens via vector quantization, where each token corresponds to a local chemical environment instead of relying on syntactic strings such as SMILES. These tokens create a molecular language that supports Transformer-based pretraining and downstream modeling. The framework is tested specifically in protein-ligand interaction prediction under a protein-cold split setting that excludes three-dimensional structural information. Results show consistent performance gains over conventional tokenization baselines, indicating that the design of tokens influences how effectively language models capture chemistry. A sympathetic reader cares because improved tokenization could produce more reliable AI systems for molecular tasks that generalize to unseen proteins.

Core claim

VQ-Atom performs semantic discretization by feeding graph neural network embeddings of atoms into a vector quantization layer that assigns each atom to one of a learned set of codebook vectors; each codebook entry represents a chemically meaningful atomic context. The resulting sequence of discrete tokens defines a language representation of the molecule suitable for Transformer pretraining. When this representation is used for protein-ligand interaction prediction without 3D coordinates and under a protein-cold split, it produces higher predictive accuracy than models built on standard syntactic tokenizations.

What carries the argument

Vector quantization applied to graph neural network atom embeddings, which maps continuous vectors to discrete codebook entries that stand for distinct local chemical environments.

If this is right

Semantically grounded discretization improves predictive performance on protein-ligand interaction tasks compared with syntactic tokenization.
Token design itself determines how well Transformer language models learn useful representations of molecules.
Strong results are possible without three-dimensional structural data in the prediction pipeline.
The learned atomic contexts support generalization to proteins not seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The learned codebook could be inspected to identify recurring chemical motifs that drive the performance lift.
The same quantization step might be applied to other graph-structured domains such as materials or protein sequences alone.
Larger-scale pretraining on diverse molecular graphs could further increase the advantage of these discrete tokens for generative tasks.

Load-bearing premise

The codebook entries obtained from vector quantization on GNN embeddings correspond to chemically meaningful atomic contexts that remain relevant to the downstream task and generalize beyond the training distribution.

What would settle it

Retraining the same downstream model with randomly assigned discrete labels in place of the learned VQ tokens and observing no drop or even an increase in protein-ligand prediction accuracy under the protein-cold split would show that semantic content is not required for the reported gains.

Figures

Figures reproduced from arXiv: 2605.16823 by Takayuki Kimura.

**Figure 2.** Figure 2: Examples of VQ-Atom tokenization. Left: original molecular structures (Gefitinib and Erlotinib). Right: corresponding VQ-Atom representations, where each atom is assigned a discrete token ID based on its local chemical environment. Identical local environments are mapped to the same token ID across molecules. Bolded IDs highlight recurring token patterns shared across chemically similar substructures. 3 Me… view at source ↗

**Figure 3.** Figure 3: Overview of the DTI prediction framework. Protein sequences are encoded using a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scatter plot of AUROC across random seeds. Each point corresponds to a single seed, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of positive rates per ligand and per protein under the protein-cold split. While [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Molecular representation learning has become a central approach in AI-driven drug discovery, yet existing molecular tokenizations such as SMILES remain largely syntactic and do not naturally align with chemically meaningful substructures. In this work, we introduce VQ-Atom, a semantic discretization framework that converts continuous atom-level graph representations into discrete tokens corresponding to local chemical environments. Using graph neural network embeddings and vector quantization, atoms are assigned to codebook entries representing chemically meaningful atomic contexts. These discrete tokens define a molecular language suitable for Transformer-based pretraining. We evaluate VQ-Atom in protein-ligand interaction prediction under a protein-cold split setting without relying on 3D structural information. Experimental results show that VQ-Atom consistently improves predictive performance compared to conventional tokenization approaches, suggesting that semantically grounded discretization can substantially enhance molecular representation learning. Our findings indicate that token design itself plays a critical role in enabling effective language modeling for chemistry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VQ-Atom discretizes GNN atom embeddings with vector quantization to create tokens for molecular Transformers, but the gains on protein-ligand prediction lack checks that the tokens reflect real chemical contexts rather than just regularization.

read the letter

The main thing to know is that this paper takes continuous atom embeddings from a GNN, runs them through vector quantization to produce a codebook of discrete tokens, and then feeds those tokens into a Transformer for pretraining. They test it on protein-ligand interaction prediction under a protein-cold split with no 3D input and report consistent gains over standard SMILES-style tokenization. That setup targets a real pain point in molecular language models where syntactic tokens do not line up with substructures that matter for binding or reactivity. The approach is straightforward and builds on existing VQ ideas without overcomplicating the pipeline. It does a clean job showing that token choice can affect downstream performance in a generalization setting that avoids easy data leakage. The protein-cold split is a sensible choice here. The soft spots are more noticeable. The central claim rests on the tokens being semantically grounded in local chemical environments, yet there is no mapping of codebook entries back to known atom types, hybridization, or functional groups, and no ablation that holds the rest of the model fixed while varying only the discretization step. Without those, the performance lift could come from the quantization acting as a regularizer or from differences in embedding dimension and training schedule rather than any chemical meaning. The abstract states the improvements but gives no numbers, confidence intervals, or baseline details, which makes it difficult to judge how large or robust the effect actually is. This work is aimed at groups already building graph-to-sequence models for drug discovery or property prediction. A reader looking for fresh tokenization tricks could pick up the basic recipe and try it, but they would still need to add their own validation experiments. I would send it to peer review once the authors add the missing post-hoc analysis and ablations; the core idea is worth referee time even if the current evidence is preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces VQ-Atom, a semantic discretization framework that uses graph neural network embeddings followed by vector quantization to convert continuous atom-level representations into discrete tokens corresponding to local chemical environments. These tokens are positioned as a molecular language for subsequent Transformer-based pretraining. The approach is evaluated on protein-ligand interaction prediction under a protein-cold split without 3D structural information, with the claim that it yields consistent improvements over conventional tokenization methods.

Significance. If the empirical gains are robust and the discretization is shown to capture chemically relevant contexts that generalize, the work could meaningfully advance molecular representation learning by moving beyond syntactic tokenizations such as SMILES toward semantically grounded discrete units. The protein-cold split provides a relevant test of generalization for drug-discovery applications. The emphasis on token design as a key factor is a useful perspective, though its impact depends on stronger validation of the semantic claims.

major comments (2)

[Results section] Results section: The abstract and introduction assert consistent predictive improvements, yet no specific quantitative metrics (e.g., AUC or precision values), baseline implementations, number of runs, or statistical significance tests are supplied. This absence prevents verification of the magnitude and reliability of the reported gains, which are central to the paper's contribution.
[Method and interpretation sections] Method and interpretation sections: The claim that codebook entries represent 'chemically meaningful atomic contexts' is not supported by any post-hoc analysis (e.g., mapping indices to atom types, hybridization states, or functional groups) or ablation isolating the semantic content of the tokens from other pipeline choices such as GNN depth or the Transformer objective. Without these checks, performance differences could arise from regularization effects of discretization rather than chemical relevance, weakening the interpretation of the headline result.

minor comments (2)

[Notation] Notation for embeddings, codebook, and quantization loss could be summarized in a table for clarity.
[Related work] A few additional references to prior discrete representation work in graph-based molecular models would strengthen the related-work discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation and support for our claims.

read point-by-point responses

Referee: [Results section] Results section: The abstract and introduction assert consistent predictive improvements, yet no specific quantitative metrics (e.g., AUC or precision values), baseline implementations, number of runs, or statistical significance tests are supplied. This absence prevents verification of the magnitude and reliability of the reported gains, which are central to the paper's contribution.

Authors: We acknowledge that the abstract and introduction currently describe improvements only in qualitative terms. The results section reports comparative performance on the protein-ligand task, but to address the referee's concern directly we will revise the abstract and introduction to include explicit quantitative metrics (e.g., AUC values), details on baseline implementations, the number of independent runs performed, and statistical significance testing. These changes will make the magnitude and reliability of the gains immediately verifiable. revision: yes
Referee: [Method and interpretation sections] Method and interpretation sections: The claim that codebook entries represent 'chemically meaningful atomic contexts' is not supported by any post-hoc analysis (e.g., mapping indices to atom types, hybridization states, or functional groups) or ablation isolating the semantic content of the tokens from other pipeline choices such as GNN depth or the Transformer objective. Without these checks, performance differences could arise from regularization effects of discretization rather than chemical relevance, weakening the interpretation of the headline result.

Authors: We agree that additional analysis is needed to substantiate the semantic interpretation. In the revised manuscript we will add a post-hoc examination mapping codebook entries to atom types, hybridization states, and functional groups, together with ablation studies that vary GNN depth and isolate the contribution of vector quantization from the Transformer pretraining objective. These additions will help distinguish semantic relevance from possible regularization effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains reported from experiments

full rationale

The paper proposes VQ-Atom as a discretization method that applies vector quantization to GNN-derived atom embeddings to produce discrete tokens representing local chemical contexts, then uses these tokens for Transformer pretraining and evaluates the approach on protein-ligand binding prediction under a protein-cold split. The headline result is framed as an observed performance improvement relative to baseline tokenization schemes, not as a quantity algebraically derived from or equivalent to fitted parameters, self-citations, or prior ansatzes within the same work. No equations or steps in the described pipeline reduce by construction to the inputs (e.g., no fitted codebook entries are relabeled as predictions of chemical meaning, and no uniqueness theorem is imported from overlapping prior authorship). The validation remains externally falsifiable via held-out experimental metrics, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; the primary unstated premise is that GNN-derived embeddings contain sufficient local chemical information for vector quantization to produce useful discrete tokens. No free parameters or invented entities are explicitly described.

axioms (1)

domain assumption Graph neural network embeddings of atoms capture local chemical environments in a form suitable for meaningful discretization.
The framework begins by producing continuous atom embeddings via GNNs before applying vector quantization.

pith-pipeline@v0.9.0 · 5683 in / 1328 out tokens · 43729 ms · 2026-05-19T20:55:05.744812+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vector quantization is applied using per-atom-type codebooks of size 10,000, initialized via k-means and updated with exponential moving averages... L = L_commit + λ1 L_lat-repel + λ2 L_cb-repel
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019

work page 2019
[2]

Language models are few-shot learners.NeurIPS, 2020

Tom B Brown et al. Language models are few-shot learners.NeurIPS, 2020

work page 2020
[3]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, 2016

work page 2016
[4]

Smiles, a chemical language and information system

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

work page 1988
[5]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2020

Robert Geirhos et al. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2020

work page 2020
[6]

Neural message passing for quantum chemistry

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. InICML, 2017. 10

work page 2017
[7]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017

work page 2017
[8]

Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

work page 2010
[9]

Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P Over- ington

Anna Gaulton, Louisa J Bellis, A. Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P Over- ington. Chembl: a large-scale bioactivity database for drug discovery.Nucleic Acids Research, 40(D1):D1100–D1107, 2012

work page 2012
[10]

Renxiao Wang, Xueliang Fang, Yipin Lu, Chaoyuan Yang, and Shaomeng Wang. The pdb- bind database: collection of binding affinities for protein–ligand complexes with known three- dimensional structures.Journal of Medicinal Chemistry, 47(12):2977–2980, 2004

work page 2004
[11]

Simboost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines.Journal of Cheminformatics, 9(1):24, 2017

Tong He, Michael Heidemeyer, Fajie Ban, Artem Cherkasov, and Martin Ester. Simboost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines.Journal of Cheminformatics, 9(1):24, 2017

work page 2017
[12]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023
[13]

Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh

Thin Nguyen, Hang Le, Thomas P. Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh. Graphdta: predicting drug-target binding affinity with graph neural networks.Bioinformatics, 37(8):1140–1147, 2021

work page 2021
[14]

Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

work page 2017
[15]

Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities.Nucleic Acids Research, 35(suppl_1):D198–D201, 2007

Tiqing Liu, Yu Lin, Xinyan Wen, Robert N Jorissen, and Michael K Gilson. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities.Nucleic Acids Research, 35(suppl_1):D198–D201, 2007. A Additional Implementation Details Each atom is represented by a concatenation of features including atomic number, degree, formal...

work page 2007

[1] [1]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InNAACL, 2019

work page 2019

[2] [2]

Language models are few-shot learners.NeurIPS, 2020

Tom B Brown et al. Language models are few-shot learners.NeurIPS, 2020

work page 2020

[3] [3]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, 2016

work page 2016

[4] [4]

Smiles, a chemical language and information system

David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988

work page 1988

[5] [5]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2020

Robert Geirhos et al. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2020

work page 2020

[6] [6]

Neural message passing for quantum chemistry

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. InICML, 2017. 10

work page 2017

[7] [7]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017

work page 2017

[8] [8]

Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

David Rogers and Mathew Hahn. Extended-connectivity fingerprints.Journal of Chemical Information and Modeling, 50(5):742–754, 2010

work page 2010

[9] [9]

Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P Over- ington

Anna Gaulton, Louisa J Bellis, A. Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P Over- ington. Chembl: a large-scale bioactivity database for drug discovery.Nucleic Acids Research, 40(D1):D1100–D1107, 2012

work page 2012

[10] [10]

Renxiao Wang, Xueliang Fang, Yipin Lu, Chaoyuan Yang, and Shaomeng Wang. The pdb- bind database: collection of binding affinities for protein–ligand complexes with known three- dimensional structures.Journal of Medicinal Chemistry, 47(12):2977–2980, 2004

work page 2004

[11] [11]

Simboost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines.Journal of Cheminformatics, 9(1):24, 2017

Tong He, Michael Heidemeyer, Fajie Ban, Artem Cherkasov, and Martin Ester. Simboost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines.Journal of Cheminformatics, 9(1):24, 2017

work page 2017

[12] [12]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

work page 2023

[13] [13]

Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh

Thin Nguyen, Hang Le, Thomas P. Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh. Graphdta: predicting drug-target binding affinity with graph neural networks.Bioinformatics, 37(8):1140–1147, 2021

work page 2021

[14] [14]

Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

work page 2017

[15] [15]

Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities.Nucleic Acids Research, 35(suppl_1):D198–D201, 2007

Tiqing Liu, Yu Lin, Xinyan Wen, Robert N Jorissen, and Michael K Gilson. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities.Nucleic Acids Research, 35(suppl_1):D198–D201, 2007. A Additional Implementation Details Each atom is represented by a concatenation of features including atomic number, degree, formal...

work page 2007