Investigation of chemical structure recognition by encoder-decoder models in learning progress

doi:10.21203/rs.3.rs-2300113/v1

Download PDF

Research Article

Investigation of chemical structure recognition by encoder-decoder models in learning progress

https://doi.org/10.21203/rs.3.rs-2300113/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 12 Apr, 2023

Read the published version in Journal of Cheminformatics →

You are reading this latest preprint version

Descriptor generation methods using latent representations of Encoder-Decoder (ED) models with SMILES as input is useful because of continuity of descriptor and restorability to structure. However, it is not clear how the structure is recognized in the learning progress of ED model. In this work, we created ED models of various learning progress and investigated the relationship between structural information and the learning progress. We showed that compound substructures were learned early in ED models by monitoring the accuracy of downstream tasks and input-output substructure similarity using substructure-based descriptor, which suggests that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models few with SMILES as descriptor generation methods. On the other hand, we showed that structure restoration was time consuming, and in particular, insufficient learning led to estimation of a larger structure than the actual one. It can be inferred that determining the end point of the structure is a difficult task for the model. To the best of our knowledge, this is the first study to link the learning progress of SMILES by ED model to chemical structures for a wide range of chemicals.

Encoder-Decoder model

Descriptor

Numerical representation of chemicals is useful for grasping their properties and is roughly divided into the two approaches: phenotype-based and structure-based representation[1–3]. The latter is often called as descriptor and how to generate good descriptor is one of big topics in the field of chemoinformatics[4, 5]. Since the report by Gómez-Bombarelli et al. in 2016, descriptor generation methods using latent representations of Encoder-Decoder (ED) models with SMILES as input have attracted much attention[6]. ED model in natural language processing (NLP) is a model that encodes strings into numerical information once and decodes them into strings again[7, 8]. The numeric information has rich information describing the strings and is called latent representation[9]. Therefore, numerical information of chemical structures (descriptors) can be obtained by ED models learning SMILES, the string representation of chemical structures as the input. In particular, the idea of neural machine translation (NMT) was introduced by Winter et al. in 2019[10]. For example, in order to correctly translate Japanese to English, the ED model must understand not only the characters of both Japanese and English but also the context of the strings[11, 12]. The latent representation of NMT models learning SMILES includes the context of SMILES, i.e., the entire chemical structure[13–15].

Descriptor generation methods based on ED models with SMILES as input have two characteristics: they provide continuous descriptors and can transform numerical information back into structures. Most of conventional descriptors in the field of chemoinformatics, such as ECFP, are binary vectors based on substructures, which restricts descriptive ability of structures and restoration to original structures[16, 17]. The machine learning-based methods before ED models, such as mol2vec and NFP, are continuous representations and relatively high structure representability, but none of them can restore chemical structures from descriptors[18, 19]. Due to their ability to convert numerical information into structure, ED models, together with other generative models such as generative adversarial networks (GANs)[20], are used in de novo drug design[21–24]. On the other hand, models in the context of de novo drug design are often end-to-end models and do not cover a wide variety of chemical structures as descriptor generation methods do[25].

Lots of descriptor generation methods based on ED models with SMILES as input have been developed[6, 10, 26]. As with other descriptor generation methods, many of these studies discuss the merits of the methods based on the accuracy of some prediction model (e.g., mode of action prediction) using the generated descriptors based on the idea that “Good representation leads to good downstream results”. On the other hand, the structure restorability has not been examined well and is only ensured by confirming that the input-output consistency is higher than a certain level. Thus, descriptor generation methods based on ED models have been investigated by indirect fashions with chemical structures. The relationship between the ED models fed with SMILES for descriptor generation (not for end-to-end tasks) and chemicals structures, such as how the ED models learn and recognize the structures of various chemicals, is currently unknown despite the widespread of the use of models with SMILES.

The purpose of this study is to clarify how the structures of various compounds are recognized in the learning progress of ED models for descriptor generation. Recognition of chemical structures by the ED model is defined here as the ability to obtain numerical information reflecting the chemical structure and to reconstruct the chemical structure from the numerical information (structural representation and restoration). We created a model set consisting of ED models with various learning progress, and investigated the relationship between learning progress and chemicals structures.

Data preparation Chemical data set containing SMILES representation was obtained via ZINC15[27] and 30 million chemicals were randomly extracted for training of ED model. The following criteria were used to filter the chemicals inspired by Le et al[16]. (1) only containing organic atom set, (2) The number of heavy atoms between 3 and 50. The salts were stripped and only kept the largest fragments. A random SMILES variant was generated using the SMILES enumeration procedure[28]. For the evaluation of descriptor, The High Throughput Screening (HTS) assay data set was obtained via EPA[29]. The data was processed via R library, ToxCast-tcpl[30]. The transcriptome profile data set of MCF7 was obtained via iLINCS[31]. SMILES representation of chemicals in transcriptome profile data were obtained from PubChem[32] using PubChemPy 1.0.4, PubChem API used in Python.

Model preparation When creating the model, we references the model architecture developed by Winter et al[10] while we modified the bucketing strategy in order to handle stereochemistry, which is removed in Winter’s model (refer to Availability section for details). The encoder network consists of 3-layer Gated Recurrent Unit (GRU) with 256, 512, and 1024 units respectively, followed by a fully connected layer that maps the concatenated cell states of the GRU to the latent space with 256 neurons and hyperbolic tangent activation function. The decoder network takes latent space as input and feeds it into a fully connected layer with 1792 neurons. This output is split into 3 parts and used to initialize 3-layer GRU cells. The complete model is trained on minimizing the cross-entropy between the output of decoder network, sequence of probability distributions over the different possible characters, and the one-hot encoded correct characters in the target sequence. For the decoder GRU we utilized teacher forcing[33]. For robust model, 15% dropout was applied and noise sampled from a zero-centered normal distribution with a standard deviation of 0.05 was added to concatenated cell states of the GRU in encoder.

30 million chemicals extracted from ZINC data set were applied as a training set for training the models. The model was trained on translating from random SMILES to canonical SMILES. Adam optimizer was used with a learning rate of 5×10^− 4 and Exponential scheduler was used which decreases learning rate by a factor of 0.9 every 250 epochs. The batch size was set to 1024. To handle sequences of different lengths. We trained models for 6, 13, 104, 260, 338, 598 epochs to obtain models of varying accuracy. We used the framework PyTorch 1.8.0 to build and execute our purposed model. For evaluation of model accuracy, as evaluation index we defined "perfect accuracy" and "partial accuracy", represented by the following equation:

$$\mathbf{p}\mathbf{e}\mathbf{r}\mathbf{f}\mathbf{e}\mathbf{c}\mathbf{t} \mathbf{a}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y}=\frac{1}{{n}}\sum _{{i}}^{{n}}{I}\left({t}={p}\right)$$

$$\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{i}\mathbf{a}\mathbf{l} \mathbf{a}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y}=\frac{1}{{n}}\sum _{{i}}^{{n}}\left\{ \frac{1}{\underset{}{\mathbf{max}}({l}\left({t}\right),{l}\left({p}\right))}\sum _{{j}}^{\underset{}{\mathbf{min}}({l}\left({t}\right),{l}\left({p}\right))}{I}({{t}}_{{i}}={{p}}_{{i}})\right\}$$

where $n$ is the number of chemicals in evaluation set, $t$ is the correct SMILES, $p$ is the predicted SMILES, ${t}_{i}$ is the ith letter of the correct SMILES, $l\left(t\right)$ is the length of $t$ and $I\left(x\right)$ is the function that $x$ is 1 if prediction is correct and 0 otherwise.

Bottleneck layer of ED model is regarded as the low-dimensional representation of chemicals with 256 dimensions and can be regarded as the descriptor. Chemical descriptors are obtained by feeding SMILES to the encoder of the trained model.

Training HTS data ToxCast assay data was predicted with descriptors obtained from encoders as inputs. We selected XGBoost as a representative machine learning method and hyperparameters were optimized using Optuna for each assay and 3-fold cross validation was applied[34, 35]. Optimized hyper parameters and the conditions for optimization are provided in Additional file 1.

The HTS assay was filtered by following criteria: (1) containing more than 7000 experimental chemicals, (2) the ratio of active chemicals to total is higher than 5%, and we extracted 113 assays (listed in Additional file 2). 25% of each assay data was split for the test set and used to evaluate the trained model. We used two evaluation index of model accuracy, AUROC and MCC.

Visualization of chemical space To visualize the distribution of chemicals formed by descriptors, dimensionality reduction was performed using UMAP[36] and descriptors of 292 CMap chemicals were subjected to the algorithm. For understanding the difference of chemical spaces obtained from models with different accuracy, the ECFP of each chemical and Tanimoto coefficient of any two ECFPs was calculated[37]. Based on Tanimoto coefficient, we defined three chemical groups with similar structures: (1) coefficient with estradiol is higher than 0.25 (estrogens) except for Fluvestrant due to long hydrocarbon chain which is inappropriate for a similar structure with estradiol, (2) coefficient with apigenin is higher than 0.25 (flavonoids), (3) coefficient with isoconazole is higher than 0.25 (azoles). Scatter plot of chemicals embedded by UMAP was illustrated and chemicals in the three groups were highlighted. The chemicals in each group are listed in Additional file 3.

Investigation of substructure learning Using one of the 100K compounds set used for model evaluation, we extracted structures that generated valid structures in all models whose accuracy were higher than Model_0.1. MACCS Key and 2048 bit ECFP were calculated for the input and generated structures for each compound, and the agreement was calculated by Tanimoto coefficient[17].

Investigation of structure incorrectly decoded Using one of the 100K compounds set used for model evaluation, the string length and molecular weight of actual SMILES versus that of predicted SMILES that were not correctly decoded during the evaluation of each model were calculated. These are shown in a scatterplot, together with a straight line with y = x (value of actual SMILES is equal to that of predicted SMILES).

Preparation of Encoder-Decoder model with various accuracy: various ED model set In order to investigate how the learning progress of ED models is related to the recognition of chemical structures, we first constructed a set of models with different learning progress. We randomly selected 30 million compounds from the ZINC database and prepared several models with different translation accuracies by controlling the learning progress[27]. As a negative control, we also prepared a set of 0-epoch models that were not trained at all, and constructed a set of models together.

Five sets of 10K, 100K, and 1M compounds other than those used in the training were randomly selected from the ZINC database as well and we evaluated the translation accuracy of each model on these sets. In addition to perfect accuracy, which evaluates the perfect agreement between input and output for easy interpretation as translation accuracy, partial accuracy, which evaluates partial recognition, was also employed as a measure of model accuracy (Fig. 1a). The results showed that both accuracy indices increased as the training progressed, as expected. Notably, even when perfect accuracy was as low as 0.1% and 29%, partial accuracy was relatively high at 26% and 63%, respectively (Fig. 1b). The above results indicate that even when the whole string is not correctly restored, learning of substrings progresses at a relatively early stage.

In the following, the set of ED models with various translation accuracies will be referred to as the "various ED model set," and each model in the various ED model set will be denoted by its perfect accuracy. For example, the 598 epoch model showed 94% complete accuracy, so it will be denoted as "Model_94".

Relationship between learning progress and downstream tasks with latent representation In general, the performance of descriptor generation methods is evaluated based on the accuracy of downstream tasks using descriptors. In order to evaluate the relationship between the learning progress of ED models and their structure representability, we constructed models to estimate the properties of chemicals in ToxCast datasets using descriptors generated by each model in the various ED model set and compared the accuracy of the property prediction. Figure 2a shows the area under the receiver operating characteristic curve (AUROC) and Matthews correlation coefficient (MCC)[38] for 10 representative assays out of 113 assays derived from ToxCast (refer to Methods section for details). The average of the AUROC of all 113 assays is shown in Additional file 4. Model_0 has an AUROC close to 0.5, indicating that these end-to-end tasks cannot be solved by the descriptors generated by the models without learning. On the other hand, the remaining models with learning all showed similar and relatively high prediction accuracy. Similar relationships with learning progress tasks were also confirmed with other downstream tasks: lipophilicity and solubility prediction with Lipophilicity and FreeSolv datasets, respectively (Additional file 5).

Next, we worked on the evaluation of structure representability in terms of chemical space. The connectivity map data set (292 compounds) was selected as a relatively small-size chemical set and visualized after dimensionality reduction with UMAP[31, 36]. In order to visually evaluate the closeness of chemical space of structurally similar chemicals, we prepared groups of similar chemical structures based on Tanimoto coefficient, which is a similarity index of representative compound structures, and information on compound classes (Additional file 3). As a result, in model_0, similar chemical structure groups were scattered and no clear trend was observed. On the other hand, the remaining models with training showed that similar chemical structure groups were distributed in each specific region (Fig. 2b).

These results suggest that among the properties of ED model-based descriptor generation methods, structure representability such as downstream task accuracy and chemical space are acquired early in the learning progress.

Relationship between learning progress and substructure recognition Considering the correlation between the partial accuracy in Fig. 1 and the accuracy of downstream tasks in Fig. 2 with respect to the learning progress, it is inferred that ED models recognize substructures of chemicals in the early stage of learning progress. Next, focusing on substructures, we evaluated the relationship between learning progress and structural restorability of the ED model.

Conventional descriptors such as MACCS and ECFP are discrete representations of binary vectors based on the presence or absence of substructures. Therefore, the similarity of these descriptors reflects the substructure-based similarity. We obtained the pairs of the inputs and outputs of each model in the various ED model set and compared their similarity based on substructures. The results show that the similarity of inputs and outputs (Tanimoto coefficients) of both MACCS and ECFP show a saturating curve with respect to the learning progress (Fig. 3). This trend is correlated with the partial accuracy in Fig. 1 and the accuracy of the downstream task in Fig. 2. These results suggest that ED models with SMILES as input recognize the substructures of chemicals that contribute to the downstream tasks in the early stage of learning.

Tendency of wrongly restored structures by ED model As shown in Fig. 1, the progress of perfect accuracy is slower than that of partial accuracy. Taking Figs. 2 and 3 into account as well, it suggests that sufficient learning is required to understand the entire chemical structure, compared to the understanding of substructures. Therefore, we worked on the evaluation of how ED models misrecognize structures when learning is insufficient. In order to capture the properties of misrecognized chemicals, we plotted the relationship between molecular weights of the input and the output chemicals that were wrongly restored (not matched perfectly with the inputs) by each model in the various ED model set (Fig. 4a). The results showed that molecular weights of wrongly restored chemicals were relatively large compared with the original structures. Then, we plotted the relationship between the true and predicted lengths of the strings of wrongly restored chemicals (Fig. 4b). The results showed that the ED models tended to incorrectly restore the structure increasing the string length when the learning was insufficient, and that this tendency was corrected as the learning progressed. It suggests that the ED model is unable to determine at which point the structure restoration is completed when learning is insufficient.

In this work, we analyzed how the structure of chemicals is recognized (acquisition of numerical representation reflecting the structure and reconstruction of the structure) during the learning progress in ED models, in which SMILES of various compounds are learned for descriptor generation.

The main contributions of this study are as follows:

We showed that compound substructures are learned early in ED models, and that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models as descriptor generation methods.
On the other hand, we showed that structure restoration is time consuming, and in particular, insufficient learning leads to estimation of a larger structure than the actual one. It can be inferred that determining the end point of the structure is a difficult task for the model.
To the best of our knowledge, this is the first study that connects learning progress of SMILES representation and various chemical structures.

In this study, we employed the GRU model, inspired by the work of Winter et al., who first introduced NMT for SMILES. On the other hand, since 2017, Transformer has become the de facto standard in the field of NLP, and transformer-based methods for generating chemical descriptors have also been developed[39–43]. It is interesting future tasks to elucidate how transformer-based methods recognize chemical structure, although we focused on the GRU model in this study due to large differences of networks between transformer and recurrent neural networks such as GRU and LSTM. Neural network models handling SMILES are establishing a field and lots of models are devised not only for descriptor generation but also for de novo drug design and reaction prediction[25, 44, 45], while there are still many black boxes in the relationship between the model and the chemical structure. We hope that this study will help to improve the explainability of neural network models in that field.

AUROC Area Under the Receiver Operating Characteristic Curve

CMap Connectivity Map

ECFP Extended Connectivity FingerPrint

ED Encoder-Decoder

EPA Environmental Protection Agency

GAN Generative Adversarial Network

GCN Graph Convolutional Network

GNN Graph Neural Network

GRU Gated Recurrent Unit

HTS High Throughput Screening

LSTM Long Short Term Memory

MCC Matthews Correlation Coefficient

RNN Recurrent Neural Network

SMILES Simplified Molecular Input Line Entry System

Availability of data and materials

The data sets used in this study, as well as the essential code, are available at EDmodel directory in https://github.com/mizuno-group/2022.

Competing interests

The authors declare that they have no conflicts of interest.

Author’s contributions

Shumpei Nemoto*: Methodology, Software, Investigation, Writing – Original Draft, Visualization.

Tadahaya Mizuno*: Conceptualization, Resources, Supervision, Project administration, Writing – Original Draft, Writing – Review & Editing, Funding acquisition.

Hiroyuki Kusuhara: Writing – Review & Editing

All authors approved the manuscript before submission. *These authors contributed equally.

† Corresponding author: Tel: +81-3-5841-4771; E-mail: [email protected]

‡ Corresponding author: Tel: +81-3-5841-4770; E-mail: [email protected]

Acknowledgements

We thank all those who contributed to the construction of the following data sets employed in the present study such as ZINC, ToxCast, and CMap[30, 31]. This work was supported by the JSPS KAKENHI Grant-in-Aid for Scientific Research (C) (grant number 21K06663) from the Japan Society for the Promotion of Science.

Additional files

Additional file 1, .csv, Optimized hyperparameters of XGBoost by Optuna. Each hyperparameter takes values from min to max.

Additional file 2, .csv, 113 HTS assays predicted by XGBoost from ToxCast. All assays have a simple size of at least 7000.

Additional file 3, .csv, List of Tanimoto coefficients with similar compound groups and representative compounds.

Additional file 4, .pptx, AUC and MCC of 113 assays prediction compared between perfect accuracy of Encoder-Decoder models. Bar height and error bar indicate mean and standard deviation.

Additional file 5, .pptx, Coefficient of determination of Lipophilicity and FleeSolv prediction compared between perfect accuracy of Encoder-Decoder models. Lipophilicity and FleeSolv data were obtained from MoleculeNet (https://moleculenet.org/). XGBoost was used as machine learning algorithm for prediction. Hyperparameters listed in Additional file 1 were optimized using Optuna for each dataset prediction with optimization index of RMSE and n_trials of 50.

Wang Z, Clark NR, Ma’ayan A (2016) Drug-induced adverse events prediction with the LINCS L1000 data. Bioinformatics 32:2338–2345. https://doi.org/10.1093/bioinformatics/btw168
Low Y, Sedykh A, Fourches D et al (2013) Integrative Chemical–Biological Read-Across Approach for Chemical Hazard Classification. Chem Res Toxicol 26:1199–1208. https://doi.org/10.1021/tx400110f
Nemoto S, Morita K, Mizuno T, Kusuhara H (2021) Decomposition Profile Data Analysis for Deep Understanding of Multiple Effects of Natural Products. J Nat Prod 84:1283–1293. https://doi.org/10.1021/acs.jnatprod.0c01381
Chuang KV, Gunsalus LM, Keiser MJ (2020) Learning Molecular Representations for Medicinal Chemistry. J Med Chem 63:8705–8722. https://doi.org/10.1021/acs.jmedchem.0c00385
Carracedo-Reboredo P, Liñares-Blanco J, Rodríguez-Fernández N et al (2021) A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J 19:4538–4558. https://doi.org/10.1016/j.csbj.2021.08.011
Gómez-Bombarelli R, Wei JN, Duvenaud D et al (2018) Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent Sci 4:268–276. https://doi.org/10.1021/acscentsci.7b00572
Bowman SR, Vilnis L, Vinyals O et al(2015) Generating Sentences from a Continuous Space. arXiv arXiv:1511.06349
Sutskever I, Vinyals O, Le QV (2014) Sequence to Sequence Learning with Neural Networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. pp 3104–3112
Bahdanau D, Cho K, Bengio Y (2014) Neural Machine Translation by Jointly Learning to Align and Translate. arXiv arXiv:1949.0473
Winter R, Montanari F, Noé F, Clevert DA (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10:1692–1701. https://doi.org/10.1039/c8sc04175j
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. pp 103–111
Kalchbrenner N, Blunsom P(2013) Recurrent continuous translation models. EMNLP 2013–2013 Conf Empir Methods Nat LangProcess Proc Conf1700–1709
Harel S, Radinsky K (2018) Accelerating Prototype-Based Drug Discovery using Conditional Diversity Networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, pp 331–339
He J, You H, Sandström E et al (2021) Molecular optimization by capturing chemist’s intuition using deep neural networks. J Cheminform 13:26. https://doi.org/10.1186/s13321-021-00497-0
Gupta A, Müller AT, Huisman BJH et al (2018) Generative Recurrent Networks for De Novo Drug Design. Mol Inf 37:1700111. https://doi.org/10.1002/minf.201700111
Le T, Winter R, Noé F, Clevert DA (2020) Neuraldecipher-reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures. Chem Sci 11:10378–10389. https://doi.org/10.1039/d0sc03115a
Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL Keys for Use in Drug Discovery. J Chem Inf Comput Sci 42:1273–1280. https://doi.org/10.1021/ci010132r
Jaeger S, Fulle S, Turk S (2018) Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model 58:27–35. https://doi.org/10.1021/acs.jcim.7b00616
Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J et al(2015) Convolutional networks on graphs for learning molecular fingerprints.Adv Neural Inf Process Syst 2015-Janua:2224–2232
Goodfellow IJ, Pouget-Abadie J, Mirza M et al(2014) Generative Adversarial Nets. In:Advances in Neural Information Processing Systems27
Abbasi M, Santos BP, Pereira TC et al (2022) Designing optimized drug candidates with Generative Adversarial Network. J Cheminform 14:1–16. https://doi.org/10.1186/s13321-022-00623-6
Maziarz K, Jackson-Flux H, Cameron P et al (2021) Learning to Extend Molecular Scaffolds with Structural Motifs. ICLR 2022 1–22
Putin E, Asadulaev A, Ivanenkov Y et al (2018) Reinforced Adversarial Neural Computer for de Novo Molecular Design. J Chem Inf Model 58:1194–1204. https://doi.org/10.1021/acs.jcim.7b00690
Prykhodko O, Johansson SV, Kotsias P-C et al (2019) A de novo molecular generation method using latent vector based generative adversarial network. J Cheminform 11:74. https://doi.org/10.1186/s13321-019-0397-9
Martinelli DD (2022) Generative machine learning for de novo drug discovery: A systematic review. Comput Biol Med 145:105403. https://doi.org/10.1016/j.compbiomed.2022.105403
Lin X, Quan Z, Wang ZJ et al (2020) A novel molecular representation with BiGRU neural networks for learning atom. Brief Bioinform 21:2099–2111. https://doi.org/10.1093/bib/bbz125
Sterling T, Irwin JJ (2015) ZINC 15 – Ligand Discovery for Everyone. J Chem Inf Model 55:2324–2337. https://doi.org/10.1021/acs.jcim.5b00559
Bjerrum EJ(2017) SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv arXiv:1703.07076
United States Environmental Protection Agency. https://www.epa.gov/
CompTox-ToxCast-tcpl. https://github.com/USEPA/CompTox-ToxCast-tcpl
Lamb J, Crawford ED, Peck D et al (2006) The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Sci (80-) 313:1929–1935. https://doi.org/10.1126/science.1132939
Kim S, Thiessen PA, Bolton EE et al (2016) PubChem Substance and Compound databases. Nucleic Acids Res 44:D1202–D1213. https://doi.org/10.1093/nar/gkv951
Williams RJ, Zipser D (1989) A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Comput 1:270–280. https://doi.org/10.1162/neco.1989.1.2.270
Chen T, Guestrin C(2016) XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, pp 785–794
Akiba T, Sano S, Yanase T et al(2019) Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv arXiv:1907.10902
McInnes L, Healy J, Melville J(2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv arXiv:1802.03426v3
Rogers D, Hahn M (2010) Extended-Connectivity Fingerprints. J Chem Inf Model 50:742–754. https://doi.org/10.1021/ci100050t
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21:6. https://doi.org/10.1186/s12864-019-6413-7
Sun X, Yang D, Li X et al(2021) Interpreting Deep Learning Models in Natural Language Processing: A Review. arXiv arXiv:2110.10470
Irwin R, Dimitriadis S, He J, Bjerrum EJ (2022) Chemformer: A pre-trained transformer for computational chemistry. Mach Learn Sci Technol 3:1–15. https://doi.org/10.1088/2632-2153/ac3ffb
Hu F, Wang D, Hu Y et al Generating Novel Compounds Targeting SARS-CoV-2 Main Protease Based on Imbalanced Dataset. In: 2020 IEEE International Conference on Bioinformatics and Biomedicine(2020) (BIBM). IEEE, pp 432–436
Maziarka Ł, Danel T, Mucha S et al(2020) Molecule Attention Transformer. arXiv arXiv:2002.08264
Kim H, Na J, Lee WB (2021) Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention. J Chem Inf Model 61:5804–5814. https://doi.org/10.1021/acs.jcim.1c01289
Mercado R, Rastemo T, Lindelöf E et al (2021) Graph networks for molecular design. Mach Learn Sci Technol 2:025023. https://doi.org/10.1088/2632-2153/abcf91
Ertl P, Lewis R, Martin E, Polyakov V(2017) In silico generation of novel, drug-like chemical matter using the LSTM neural network.arXiv arXiv:1712.07449

No competing interests reported.

Download PDF

Journal Publication

published 12 Apr, 2023

Read the published version in Journal of Cheminformatics →

Editorial decision: Major revision
19 Feb, 2023
Reviews received at journal
13 Feb, 2023
Reviewers agreed at journal
17 Jan, 2023
Reviewers invited by journal
25 Nov, 2022
Editor assigned by journal
23 Nov, 2022
Submission checks completed at journal
23 Nov, 2022
First submitted to journal
22 Nov, 2022

You are reading this latest preprint version

Investigation of chemical structure recognition by encoder-decoder models in learning progress

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Results And Discussion

Conclustions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1