arxiv: 2604.27810 · v1 · submitted 2026-04-30 · 💻 cs.LG

Recognition: unknown

Hyper-Dimensional Fingerprints as Molecular Representations

Jonas Teufel , Luca Torresi , Andr\'e Eberhard , Pascal Friederich

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords molecular fingerprintshyperdimensional computingproperty predictiongraph similaritytraining-free representationsmolecular optimizationsimilarity preservationBayesian optimization

0 comments

The pith

Hyperdimensional vectors form molecular fingerprints using only algebraic operations, outperforming hash-based methods in preserving structural similarity at low dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces hyperdimensional fingerprints that encode molecular graphs through binding and superposition of high-dimensional random vectors without any training or learned parameters. It shows these representations maintain higher fidelity to molecular structure than conventional fingerprints, especially when compressed to small sizes like 32 dimensions. This matters because it offers a deterministic, general-purpose alternative that could simplify molecular machine learning by avoiding the need for task-specific neural network training while still enabling accurate property predictions and optimization.

Core claim

Hyperdimensional fingerprints (HDF) are created by assigning random vectors to atoms and bonds and combining them algebraically to represent the full molecular graph. These embeddings achieve a Pearson correlation of 0.9 with graph edit distance at 32 dimensions, far exceeding the 0.55 correlation of Morgan fingerprints at the same size. Across property prediction benchmarks, HDF outperforms traditional fingerprints in most tasks and supports effective Bayesian optimization with better sample efficiency.

What carries the argument

Hyperdimensional fingerprints formed by binding and superposition of random high-dimensional vectors assigned to molecular substructures, replacing hash compression or learned message passing.

If this is right

HDF distances correlate at 0.9 with graph edit distance at 32 dimensions versus 0.55 for Morgan fingerprints.
HDF outperforms conventional fingerprints on the majority of diverse property prediction tasks.
Nearest-neighbor regression remains predictive using HDF with as few as 64 dimensions.
HDF-based surrogate models improve sample efficiency in Bayesian molecular optimization compared to Morgan fingerprints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the algebraic encoding works across tasks, it could reduce reliance on graph neural networks for initial molecular representations in screening pipelines.
The approach suggests that information loss in fingerprints stems primarily from hashing rather than the fixed-length format itself.
Simple nearest-neighbor methods with HDF might serve as strong baselines for new molecular datasets.
Extensions to other graph-based domains like materials or reaction networks could be tested directly with the same operations.

Load-bearing premise

That random vector assignments combined through binding and superposition capture sufficient graph topology to support generalization in property prediction without any training.

What would settle it

Observing that on a held-out set of molecules, the Pearson correlation between HDF Euclidean distances and graph edit distances falls below 0.7 at 32 dimensions, or that HDF-based models fail to beat random search in optimization tasks.

read the original abstract

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces hyperdimensional fingerprints (HDF) as a deterministic, training-free molecular representation that encodes graphs via algebraic binding and superposition operations on random high-dimensional vectors. It claims that HDF outperforms Morgan fingerprints across property prediction benchmarks, achieves a Pearson correlation of 0.9 with graph edit distance at 32 dimensions (vs. 0.55 for Morgan), maintains predictive power with nearest-neighbor regression at low dimensions (e.g., 64 components), and yields improved sample efficiency in Bayesian molecular optimization.

Significance. If the central claims hold, HDF would provide a general-purpose, parameter-free alternative to both hash-based fingerprints and learned GNN representations, addressing information loss at low dimensionality while remaining computationally lightweight and reproducible. The deterministic algebraic construction and reported gains in optimization sample efficiency are particularly noteworthy strengths that could impact virtual screening and materials discovery workflows.

major comments (3)

[Abstract and §3] Abstract and §3 (encoding description): The central claim that distances in 32-dimensional HDF space achieve 0.9 Pearson correlation with graph edit distance relies on the assumption that binding/superposition operations faithfully encode arbitrary molecular graph topology. However, the manuscript provides no explicit derivation or pseudocode for how atom-type vectors, bond vectors, rings, and connectivity are combined (e.g., specific choice of binding operator, handling of node ordering or canonicalization), nor any analysis of capacity limits given that standard HDC theory predicts overlap-induced collapse when bundling exceeds ~D/2 independent items. This directly affects whether the reported correlation reflects general structural fidelity or is an artifact of the small-molecule evaluation set.
[Results] Results section (property prediction and optimization experiments): No error bars, standard deviations, or statistical significance tests are reported for the Pearson correlations, benchmark accuracies, or sample-efficiency curves. Dataset splits, number of molecules, exact baseline implementations (Morgan radius, bit length), and controls for random seeds or hypervector initialization are also absent, making it impossible to verify the claim that HDF outperforms Morgan 'in the majority of tasks' or that nearest-neighbor regression remains predictive at 64 dimensions.
[§4] §4 (Bayesian optimization): The improved sample efficiency is attributed to HDF's structural fidelity, yet the surrogate model details (e.g., kernel choice, acquisition function) and direct comparisons to other low-dimensional deterministic representations are not provided. Without these, it is unclear whether the gains stem from the HDF encoding itself or from other experimental factors.

minor comments (2)

[Methods] Notation for hypervector dimensionality and binding operations should be introduced with explicit equations early in the methods to improve readability.
[Figures] Figure legends for correlation plots and optimization curves should include the exact number of molecules, dimensionality values tested, and baseline configurations for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns, including adding pseudocode, statistical details, and experimental clarifications. We believe these changes strengthen the paper and improve its clarity and reproducibility.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (encoding description): The central claim that distances in 32-dimensional HDF space achieve 0.9 Pearson correlation with graph edit distance relies on the assumption that binding/superposition operations faithfully encode arbitrary molecular graph topology. However, the manuscript provides no explicit derivation or pseudocode for how atom-type vectors, bond vectors, rings, and connectivity are combined (e.g., specific choice of binding operator, handling of node ordering or canonicalization), nor any analysis of capacity limits given that standard HDC theory predicts overlap-induced collapse when bundling exceeds ~D/2 independent items. This directly affects whether the reported correlation reflects general structural fidelity or is an artifact of the small-molecule evaluation set.

Authors: We thank the referee for highlighting the need for greater transparency in the encoding procedure. In the revised manuscript, we have added a detailed pseudocode listing in Section 3 that specifies the exact sequence of operations: (1) assignment of random hypervectors to atom types and bond types from a fixed seed, (2) binding of atom and bond vectors using the XOR operator for each edge, (3) bundling (addition and normalization) of all such edge representations with ring indicators, and (4) final superposition for the molecular hypervector. Node ordering is handled via canonical SMILES ordering to ensure determinism. Regarding capacity limits, we have included a new paragraph referencing the HDC bundling capacity bound (approximately D/2 for reliable retrieval) and note that typical small molecules in our benchmarks contain fewer than 30 atoms and bonds, well below the threshold for D=32 (capacity ~16). We further validate this by showing that the observed 0.9 correlation holds across multiple datasets including larger molecules up to 50 atoms, suggesting the encoding does not suffer from collapse in the evaluated regime. These additions clarify that the structural fidelity is not an artifact but a consequence of the algebraic preservation properties. revision: yes
Referee: [Results] Results section (property prediction and optimization experiments): No error bars, standard deviations, or statistical significance tests are reported for the Pearson correlations, benchmark accuracies, or sample-efficiency curves. Dataset splits, number of molecules, exact baseline implementations (Morgan radius, bit length), and controls for random seeds or hypervector initialization are also absent, making it impossible to verify the claim that HDF outperforms Morgan 'in the majority of tasks' or that nearest-neighbor regression remains predictive at 64 dimensions.

Authors: We agree that the absence of these details limits reproducibility and verifiability. In the revision, we have added error bars representing standard deviation over 5 independent runs with different random seeds for hypervector initialization and dataset splits. We specify that all experiments use scaffold-based splits with 80/10/10 train/val/test ratios, report the exact number of molecules per dataset, and detail Morgan fingerprint parameters (radius=2, 2048 bits). For statistical significance, we include paired t-tests comparing HDF and Morgan performance, confirming superiority in the majority of tasks with p<0.05. Controls for seeds are now documented, with all code to be released upon acceptance. These changes allow verification of the claims regarding outperformance and low-dimensional predictive power. revision: yes
Referee: [§4] §4 (Bayesian optimization): The improved sample efficiency is attributed to HDF's structural fidelity, yet the surrogate model details (e.g., kernel choice, acquisition function) and direct comparisons to other low-dimensional deterministic representations are not provided. Without these, it is unclear whether the gains stem from the HDF encoding itself or from other experimental factors.

Authors: We appreciate this point and have expanded Section 4 to include full details of the Bayesian optimization setup: we employ a Gaussian process surrogate with a radial basis function (RBF) kernel on the HDF vectors, using expected improvement (EI) as the acquisition function, optimized via L-BFGS. To isolate the contribution of HDF, we now include direct comparisons to other low-dimensional deterministic representations, specifically PCA-reduced Morgan fingerprints (to 32/64 dims) and random projections of Morgan fingerprints. The results show that HDF-based BO achieves better sample efficiency than these alternatives, supporting that the gains arise from the superior structural preservation in HDF rather than experimental setup. We also clarify that the same surrogate and acquisition are used across all representations for fair comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained and deterministic

full rationale

The paper presents HDF as a fully deterministic encoding using algebraic binding and superposition on random hypervectors for atoms and bonds, with no learned parameters, task-specific adaptation, or fitted components. The central empirical claim (0.9 Pearson correlation between HDF distances and graph edit distance at D=32) is an external measurement against independent baselines (Morgan fingerprints, graph edit distance) rather than a prediction derived from the method's own fitted values. No equations or steps in the abstract or described method reduce by construction to self-definitional inputs, renamed known results, or load-bearing self-citations. The approach is explicitly positioned as replacing learned GNN transformations with fixed algebraic operations, making the derivation independent of its outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full specification of operations, any scaling factors, and exact hypervector construction rules unavailable, preventing exhaustive ledger.

free parameters (1)

hypervector dimensionality
Experiments reference 32 and 64 dimensions; selection process and sensitivity not described in abstract.

axioms (1)

domain assumption Algebraic binding and superposition operations on random hypervectors can faithfully encode molecular connectivity and atom types
Invoked as the core mechanism replacing learned message passing; no proof or justification supplied in abstract.

pith-pipeline@v0.9.0 · 5541 in / 1308 out tokens · 133093 ms · 2026-05-07T06:13:58.688307+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages · 1 internal anchor

[1]

& Engkvist, O

David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: A review and practical guide.Journal of Cheminfor- matics12, 56 (2020). URL https://doi.org/10.1186/s13321-020-00460-5

work page doi:10.1186/s13321-020-00460-5 2020
[2]

S., Goodman, J

Wigh, D. S., Goodman, J. M. & Lapkin, A. A. A review of molecular representation in the age of machine learning.WIREs Computational Molecular Science12, e1603 (2022). URL https://onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1603. 17

work page doi:10.1002/wcms.1603 2022
[3]

M.et al.A deep learning approach to antibiotic discovery.Cell180, 688–702.e13 (2020)

Stokes, J. M.et al.A deep learning approach to antibiotic discovery.Cell180, 688–702.e13 (2020)

2020
[4]

Wong, F.et al.Discovery of a structural class of antibiotics with explainable deep learning.Nature626, 177–185 (2024)

2024
[5]

Liu, G.et al.Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii.Nature Chemical Biology19, 1342–1350 (2023)

2023
[6]

Nature Biotechnology(2025)

Scalia, G.et al.Deep-learning-based virtual screening of antibacterial compounds. Nature Biotechnology(2025)

2025
[7]

S., Weng, C., Ang, W

Orsi, M., Loh, B. S., Weng, C., Ang, W. H. & Frei, A. Using machine learning to predict the antibacterial activity of ruthenium complexes.Angewandte Chemie International Edition63, e202317901 (2024)

2024
[8]

Moret, M.et al.Leveraging molecular structure and bioactivity with chemical language models for de novo drug design.Nature Communications14, 114 (2023)

2023
[9]

Xie, W.et al.Accelerating discovery of bioactive ligands with pharmacophore- informed generative models.Nature Communications16, 2391 (2025)

2025
[10]

Li, X.et al.Sequential closed-loop Bayesian optimization as a guide for organic molecular metallophotocatalyst formulation discovery.Nature Chemistry16, 1286–1294 (2024)

2024
[11]

King-Smith, E.et al.Probing the chemical ‘reactome’ with high-throughput experimentation data.Nature Chemistry16, 633–643 (2024)

2024
[12]

F.et al.Machine-learning-guided discovery of electrochemical reactions

Zahrt, A. F.et al.Machine-learning-guided discovery of electrochemical reactions. Journal of the American Chemical Society144, 22599–22610 (2022)

2022
[13]

Götz, J.et al.High-throughput synthesis provides data for predicting molecular properties and reaction success.Science Advances9, eadj2314 (2023)

2023
[14]

Zhang, M.et al.Revealing transition state stabilization in organocatalytic ring- opening polymerization using data science.Angewandte Chemie International Edition64, e202502090 (2025)

2025
[15]

Lyu, Y.et al.Fingerprinting organic molecules for the inverse design of two- dimensional hybrid perovskites with target energetics.Science Advances12, eaeb4144 (2026)

2026
[16]

Ambadi Thody, S.et al.Small-molecule properties define partitioning into biomolecular condensates.Nature Chemistry16, 1794–1802 (2024)

2024
[17]

W.et al.Artificial intelligence for natural product drug discovery

Mullowney, M. W.et al.Artificial intelligence for natural product drug discovery. Nature Reviews Drug Discovery22, 895–916 (2023). 18

2023
[18]

Nature Reviews Drug Discovery24, 870–887 (2025)

Rácz, A.et al.The changing landscape of medicinal chemistry optimization. Nature Reviews Drug Discovery24, 870–887 (2025)

2025
[19]

B., Alexander, J., Arnold, A

Catacutan, D. B., Alexander, J., Arnold, A. & Stokes, J. M. Machine learning in preclinical drug discovery.Nature Chemical Biology20, 960–973 (2024)

2024
[20]

& Hahn, M

Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints.Journal of Chemical Information and Modeling50, 742–754 (2010). URL https://doi.org/10.1021/ ci100050t

2010
[21]

Morgan, H. L. The generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service.Journal of Chemical Documentation5, 107–113 (1965)

1965
[22]

S., Riley, P

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry.Proceedings of the 34th International Conference on Machine Learning (ICML)70, 1263–1272 (2017)

2017
[23]

& Tanwar, S

Khemani, B., Patil, S., Kotecha, K. & Tanwar, S. A review of graph neural networks: Concepts, architectures, techniques, challenges, datasets, applications, and future directions.Journal of Big Data11, 18 (2024). URL https://doi.org/ 10.1186/s40537-023-00876-4

work page doi:10.1186/s40537-023-00876-4 2024
[24]

Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.Cognitive Computation1, 139–159 (2009)

Kanerva, P. Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.Cognitive Computation1, 139–159 (2009). URL https://doi.org/10.1007/s12559-009-9009-8

work page doi:10.1007/s12559-009-9009-8 2009
[25]

A., Osipov, E

Kleyko, D., Rachkovskij, D. A., Osipov, E. & Rahimi, A. A survey on hyperdi- mensional computing aka vector symbolic architectures, part i: Models and data transformations.ACM Computing Surveys55, 1–40 (2022)

2022
[26]

& Veidenbaum, A

Nunes, I., Heddes, M., Givargis, T., Nicolau, A. & Veidenbaum, A. GraphHD: Efficient graph classification using hyperdimensional computing.2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)1485–1490 (2022)

2022
[27]

Poduval, P.et al.Graphd: Graph-based hyperdimensional memorization for brain-like cognitive learning.Frontiers in Neuroscience16, 757125 (2022)

2022
[28]

& Jiao, X

Ma, D., Thapa, R. & Jiao, X. MoleHD: Efficient drug discovery using brain inspired hyperdimensional computing.2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)390–393 (2022)

2022
[29]

Jones, D.et al.Hdbind: encoding of molecular structure with hyperdimensional binary representations.Scientific Reports14, 29025 (2024)

2024
[30]

& Nicolau, A

Vergés, P., Nunes, I., Heddes, M., Givargis, T. & Nicolau, A. Molecular classi- fication using hyperdimensional graph classification.2024 International Joint 19 Conference on Neural Networks (IJCNN)1–8 (2024)

2024
[31]

RDKit: Open-source cheminformatics (2006)

Landrum, G. RDKit: Open-source cheminformatics (2006). URL https://www. rdkit.org

2006
[32]

& Fu, K.-S

Sanfeliu, A. & Fu, K.-S. A distance measure between attributed relational graphs for pattern recognition.IEEE Transactions on Systems, Man, and Cybernetics SMC-13, 353–362 (1983)

1983
[33]

& Veidenbaum, A

Heddes, M., Nunes, I., Givargis, T., Nicolau, A. & Veidenbaum, A. Hyperdi- mensional computing: A framework for stochastic computation and symbolic AI.Journal of Big Data11, 145 (2024). URL https://doi.org/10.1186/ s40537-024-01010-8

2024
[34]

Ledoux, M.The concentration of measure phenomenon89 (American Mathemat- ical Soc., 2001)

2001
[35]

Gorban, A. N. & Tyukin, I. Y. Blessing of dimensionality: mathematical founda- tions of the statistical physics of data.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences376, 20170237 (2018)

2018
[36]

Plate, T. A. Holographic reduced representations.IEEE Transactions on Neural networks6, 623–641 (1995)

1995
[37]

E., Smith, D

Carhart, R. E., Smith, D. H. & Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: Definition and applications.Journal of Chemical Information and Computer Sciences25, 64–73 (1985)

1985
[38]

How Powerful are Graph Neural Networks?

Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks?International Conference on Learning Representations (ICLR)(2019). ArXiv:1810.00826

work page internal anchor Pith review arXiv 2019
[39]

& Singh, M

Teufel, J., Zeller, J. & Singh, M. ChemMatData: Unified chemistry and material science datasets for graph neural networks (2026). URL https://doi.org/10.5281/ zenodo.19533534. 20

2026