pith. sign in

arxiv: 1907.08333 · v1 · pith:32R5YKPLnew · submitted 2019-07-19 · 💻 cs.LG · physics.chem-ph· stat.ML

Toxicity Prediction by Multimodal Deep Learning

Pith reviewed 2026-05-24 19:23 UTC · model grok-4.3

classification 💻 cs.LG physics.chem-phstat.ML
keywords toxicity predictionmultimodal deep learningQSAR modelingneural network ensembleschemical compound representationconvolutional recurrent networks
0
0 comments X

The pith

Multimodal deep learning combining strings, images and numerical features of chemicals with multiple network types yields higher toxicity prediction accuracy than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method that represents chemical compounds using strings, images, and numerical features, then trains and ensembles fully connected, convolutional, and recurrent neural networks on these inputs. The goal is to leverage the different strengths of each representation and network type so their combined performance exceeds what any single one achieves. On a standard toxicity benchmark the authors state this produces significantly better accuracy than existing state-of-the-art toxicity prediction methods. A reader would care because accurate toxicity forecasts matter for Quantitative Structure-Activity Relationship modeling used in chemical safety and drug design, where current deep-learning results remain limited.

Core claim

Representing chemical compounds by strings, images, and numerical features and training ensembles of fully connected, convolutional, and recurrent neural networks produces significantly better accuracy on a standard toxicity benchmark than state-of-the-art toxicity prediction methods.

What carries the argument

Multimodal ensemble of fully connected, convolutional and recurrent networks applied to string, image and numerical representations of compounds.

If this is right

  • Toxicity forecasts for new chemical compounds become more reliable.
  • Quantitative Structure-Activity Relationship models can incorporate diverse data views without relying on one representation.
  • Ensemble training across network families can offset weaknesses of any individual architecture.
  • Prediction pipelines gain robustness by letting complementary signals reinforce each other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multimodal pattern could be tested on other molecular property tasks such as solubility or binding affinity.
  • If the gains persist under stricter controls, it would support building unified chemical-data platforms that automatically route inputs across representation types.
  • Future work could measure whether the improvement scales with the number of distinct representations or network families used.

Load-bearing premise

The accuracy gains come from the multimodal combination of representations and network types rather than from dataset-specific tuning or implementation details.

What would settle it

An independent run on the same benchmark that shows no statistically significant accuracy improvement over the prior state-of-the-art, or an ablation that matches the reported performance with a single representation or single network type.

Figures

Figures reproduced from arXiv: 1907.08333 by Abdollah Dehzangi, Abdul Karim, Abdul Sattar, Avinash Mishra, Jaspreet Singh, M. A. Hakim Newton.

Figure 1
Figure 1. Figure 1: shows the proposed multimodal deep learning architecture. SMILES strings of chemical compounds are first transformed into a vector format, or a molec￾ular image format, or a set of numerical features. Then, an RNN, a CNN, and an FCNN are trained respectively on the vector format, image format, and the numerical features. The coupling between the data representations and the neu￾ral network types are becaus… view at source ↗
Figure 2
Figure 2. Figure 2: Computing a molecular image from 2D coordinates generated from a SMILES string by using an open source python library rdkit 3.3 Numerical Features 2D numerical features used are less multifarious in nature and easy to calculate. 1422 2D features are computed using an open source software PADEL descriptor [23]. The main reason for using 2D features is that these descriptors have shown promising prediction p… view at source ↗
Figure 3
Figure 3. Figure 3: Resnet architecture used in CNN • Input: Input image is of the shape (100 × 100) with 4 channels. • Stage 1: The 2D convolution has 64 filters of shape (7, 7) and uses a stride of (2, 2). BatchNorm is applied to the channels axis of the input. MaxPooling uses a (3, 3) window and a (2, 2) stride. • Stage 2: The convolutional block uses three set of filters of size [64, 256, 256] each with a shape (1, 1) and… view at source ↗
Figure 4
Figure 4. Figure 4: RNN architecture • One-hot vectors: Every character of each SMILES string is one hot vector encoded and fed into embedded layer. • Embedding layer: One-hot vectors for 50 dimensional space. • 1D convolution layer: Each 1D convolution is performed using 92 filters with size of 10, 5 and 3 respectively. • Flatten: A function that flatten out the output of 1D convolution. • Fully connected or dense: The fully… view at source ↗
read the original abstract

Prediction of toxicity levels of chemical compounds is an important issue in Quantitative Structure-Activity Relationship (QSAR) modeling. Although toxicity prediction has achieved significant progress in recent times through deep learning, prediction accuracy levels obtained by even very recent methods are not yet very high. We propose a multimodal deep learning method using multiple heterogeneous neural network types and data representations. We represent chemical compounds by strings, images, and numerical features. We train fully connected, convolutional, and recurrent neural networks and their ensembles. Each data representation or neural network type has its own strengths and weaknesses. Our motivation is to obtain a collective performance that could go beyond individual performance of each data representation or each neural network type. On a standard toxicity benchmark, our proposed method obtains significantly better accuracy levels than that by the state-of-the-art toxicity prediction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multimodal deep learning approach for toxicity prediction in QSAR modeling. Chemical compounds are represented as strings, images, and numerical features; fully connected, convolutional, and recurrent networks are trained individually and in ensembles. The central claim is that this combination yields significantly higher accuracy on a standard toxicity benchmark than existing state-of-the-art methods.

Significance. If the reported gains are reproducible and demonstrably attributable to the multimodal design rather than implementation choices, the work would provide a practical demonstration that heterogeneous representations and architectures can be combined to improve toxicity prediction. The approach is directly applicable to chemical safety assessment and could serve as a template for other multimodal QSAR tasks.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'obtains significantly better accuracy levels than that by the state-of-the-art toxicity prediction methods' is stated without any numerical values, baseline scores, error bars, dataset sizes, or statistical tests. This absence prevents verification of the magnitude or reliability of the improvement.
  2. [Results] Results section (and any associated tables/figures): no modality-ablation experiments, single-network versus ensemble comparisons, or variance estimates across random seeds or data splits are referenced. Without these, it is impossible to attribute performance gains specifically to the multimodal fusion of string/image/numerical inputs and FC/CNN/RNN architectures rather than to hyperparameter tuning or baseline re-implementation details.
minor comments (1)
  1. [Abstract] The abstract and introduction should explicitly name the benchmark dataset and the precise evaluation metric (e.g., AUC, accuracy) used for the 'significantly better' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our results. We address each major comment below and agree that revisions are warranted to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'obtains significantly better accuracy levels than that by the state-of-the-art toxicity prediction methods' is stated without any numerical values, baseline scores, error bars, dataset sizes, or statistical tests. This absence prevents verification of the magnitude or reliability of the improvement.

    Authors: We agree that the abstract would be more informative with explicit metrics. In the revised manuscript we will expand the abstract to report the key accuracy figures achieved by our method, the corresponding state-of-the-art baselines, dataset size, and any statistical significance information that can be derived from the experiments already performed. revision: yes

  2. Referee: [Results] Results section (and any associated tables/figures): no modality-ablation experiments, single-network versus ensemble comparisons, or variance estimates across random seeds or data splits are referenced. Without these, it is impossible to attribute performance gains specifically to the multimodal fusion of string/image/numerical inputs and FC/CNN/RNN architectures rather than to hyperparameter tuning or baseline re-implementation details.

    Authors: We acknowledge that the original submission does not contain explicit modality-ablation studies, single-network versus ensemble breakdowns, or variance estimates across seeds or splits. While the manuscript presents the overall ensemble performance and comparisons against published state-of-the-art methods, these additional controls would strengthen attribution of the gains. We will therefore add the requested ablation tables, single-model results, and variance estimates (computed over multiple random seeds and data splits) in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance claim

full rationale

This is an empirical machine-learning paper with no mathematical derivation, equations, or self-referential constructions. The central claim is an experimental result (better accuracy on a toxicity benchmark) obtained by training standard neural network ensembles on string/image/numerical representations. No steps reduce predictions to fitted inputs by construction, import uniqueness via self-citation, or smuggle ansatzes; the method description and benchmark comparison are self-contained against external data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions about generalization from a benchmark dataset and the untested premise that heterogeneous models will combine constructively; no new entities are postulated.

free parameters (1)
  • neural network hyperparameters and ensemble weights
    Training multiple neural networks on different representations requires many hyperparameters and combination rules that are fitted to the benchmark data.
axioms (2)
  • domain assumption The chosen standard toxicity benchmark is representative of real-world toxicity prediction performance.
    All accuracy claims are evaluated only on this benchmark.
  • ad hoc to paper Combining outputs from heterogeneous networks yields performance beyond any single component without introducing new failure modes.
    This is the explicit motivation stated in the abstract but is not demonstrated with ablations.

pith-pipeline@v0.9.0 · 5684 in / 1395 out tokens · 27126 ms · 2026-05-24T19:23:08.580427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

    Bjerrum, E.J.: Smiles enumeration as data augmentation for neural network mod- eling of molecules. arXiv preprint arXiv:1703.07076 (2017)

  2. [2]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for sta- tistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  3. [3]

    The handbook of brain theory and neural networks 2, 110–125 (2002)

    Dietterich, T.G., et al.: Ensemble learning. The handbook of brain theory and neural networks 2, 110–125 (2002)

  4. [4]

    In: Methods in cell biology, vol

    Frankel, J.: Cell biology of tetrahymena thermophila. In: Methods in cell biology, vol. 62, pp. 27–125. Elsevier (1999)

  5. [5]

    In: Workshop track, International Conference on Learning Representations (2018) 10 Abdul Karim et al

    Goh, G.B., Hodas, N., Siegel, C., Vishnu, A.: Smiles2vec: Predicting chemical prop- erties from text representations. In: Workshop track, International Conference on Learning Representations (2018) 10 Abdul Karim et al

  6. [6]

    Goh, G.B., Siegel, C., Vishnu, A., Hodas, N., Baker, N.: How much chemistry does a deep neural network need to know to make accurate predictions? In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1340–1349. IEEE (2018)

  7. [7]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  8. [8]

    Elsevier (2012)

    Hill, D.G.: The biochemistry and physiology of Tetrahymena. Elsevier (2012)

  9. [9]

    Neural computation 9(8), 1735–1780 (1997)

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

  10. [10]

    ACS Omega4(1), 1874–1888 (2019)

    Karim, A., Mishra, A., Newton, M.H., Sattar, A.: Efficient toxicity prediction via simple features using shallow neural networks and decision trees. ACS Omega4(1), 1874–1888 (2019)

  11. [11]

    In: 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA)

    Kato, Y., Hamada, S., Goto, H.: Molecular activity prediction using deep learn- ing software library. In: 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA). pp. 1–6. IEEE (2016)

  12. [12]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. [13]

    Release1, 1–79 (2013)

    Landrum, G.: Rdkit documentation. Release1, 1–79 (2013)

  14. [14]

    Expert opinion on drug discovery11(3), 225–239 (2016)

    Lima, A.N., Philot, E.A., Trossini, G.H.G., Scott, L.P.B., Maltarollo, V.G., Hon- orio, K.M.: Use of machine learning approaches for novel drug discovery. Expert opinion on drug discovery11(3), 225–239 (2016)

  15. [15]

    Journal of medicinal chemistry13(6), 1192–1196 (1970)

    McFarland, J.W.: Parabolic relation between drug potency and hydrophobicity. Journal of medicinal chemistry13(6), 1192–1196 (1970)

  16. [16]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  17. [17]

    Neural networks 61, 85–117 (2015)

    Schmidhuber, J.: Deep learning in neural networks: An overview. Neural networks 61, 85–117 (2015)

  18. [18]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1–9 (2015)

  19. [19]

    Weininger, D.: Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28(1), 31–36 (1988)

  20. [20]

    Journal of chemical information and modeling58(2), 520–531 (2018)

    Wu, K., Wei, G.W.: Quantitative toxicity prediction using topology based multi- task deep neural networks. Journal of chemical information and modeling58(2), 520–531 (2018)

  21. [21]

    Chemical science 9(2), 513–530 (2018)

    Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., Pande, V.: Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2), 513–530 (2018)

  22. [22]

    Bioinformatics (2018)

    Yang, H., Lou, C., Sun, L., Li, J., Cai, Y., Wang, Z., Li, W., Liu, G., Tang, Y.: admetsar 2.0: web-service for prediction and optimization of chemical admet properties. Bioinformatics (2018)

  23. [23]

    Journal of computational chemistry32(7), 1466–1474 (2011)

    Yap, C.W.: Padel-descriptor: An open source software to calculate molecular de- scriptors and fingerprints. Journal of computational chemistry32(7), 1466–1474 (2011)

  24. [24]

    Journal of chemical information and modeling 48(4), 766–784 (2008)

    Zhu, H., Tropsha, A., Fourches, D., Varnek, A., Papa, E., Gramatica, P., Oberg, T., Dao, P., Cherkasov, A., Tetko, I.V.: Combinatorial qsar modeling of chemical toxicants tested against tetrahymena pyriformis. Journal of chemical information and modeling 48(4), 766–784 (2008)