Data preparation Chemical data set containing SMILES representation was obtained via ZINC15[27] and 30 million chemicals were randomly extracted for training of ED model. The following criteria were used to filter the chemicals inspired by Le et al[16]. (1) only containing organic atom set, (2) The number of heavy atoms between 3 and 50. The salts were stripped and only kept the largest fragments. A random SMILES variant was generated using the SMILES enumeration procedure[28]. For the evaluation of descriptor, The High Throughput Screening (HTS) assay data set was obtained via EPA[29]. The data was processed via R library, ToxCast-tcpl[30]. The transcriptome profile data set of MCF7 was obtained via iLINCS[31]. SMILES representation of chemicals in transcriptome profile data were obtained from PubChem[32] using PubChemPy 1.0.4, PubChem API used in Python.
Model preparation When creating the model, we references the model architecture developed by Winter et al[10] while we modified the bucketing strategy in order to handle stereochemistry, which is removed in Winter’s model (refer to Availability section for details). The encoder network consists of 3-layer Gated Recurrent Unit (GRU) with 256, 512, and 1024 units respectively, followed by a fully connected layer that maps the concatenated cell states of the GRU to the latent space with 256 neurons and hyperbolic tangent activation function. The decoder network takes latent space as input and feeds it into a fully connected layer with 1792 neurons. This output is split into 3 parts and used to initialize 3-layer GRU cells. The complete model is trained on minimizing the cross-entropy between the output of decoder network, sequence of probability distributions over the different possible characters, and the one-hot encoded correct characters in the target sequence. For the decoder GRU we utilized teacher forcing[33]. For robust model, 15% dropout was applied and noise sampled from a zero-centered normal distribution with a standard deviation of 0.05 was added to concatenated cell states of the GRU in encoder.
30 million chemicals extracted from ZINC data set were applied as a training set for training the models. The model was trained on translating from random SMILES to canonical SMILES. Adam optimizer was used with a learning rate of 5×10− 4 and Exponential scheduler was used which decreases learning rate by a factor of 0.9 every 250 epochs. The batch size was set to 1024. To handle sequences of different lengths. We trained models for 6, 13, 104, 260, 338, 598 epochs to obtain models of varying accuracy. We used the framework PyTorch 1.8.0 to build and execute our purposed model. For evaluation of model accuracy, as evaluation index we defined "perfect accuracy" and "partial accuracy", represented by the following equation:
$$\mathbf{p}\mathbf{e}\mathbf{r}\mathbf{f}\mathbf{e}\mathbf{c}\mathbf{t} \mathbf{a}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y}=\frac{1}{{n}}\sum _{{i}}^{{n}}{I}\left({t}={p}\right)$$
$$\mathbf{p}\mathbf{a}\mathbf{r}\mathbf{t}\mathbf{i}\mathbf{a}\mathbf{l} \mathbf{a}\mathbf{c}\mathbf{c}\mathbf{u}\mathbf{r}\mathbf{a}\mathbf{c}\mathbf{y}=\frac{1}{{n}}\sum _{{i}}^{{n}}\left\{ \frac{1}{\underset{}{\mathbf{max}}({l}\left({t}\right),{l}\left({p}\right))}\sum _{{j}}^{\underset{}{\mathbf{min}}({l}\left({t}\right),{l}\left({p}\right))}{I}({{t}}_{{i}}={{p}}_{{i}})\right\}$$
where \(n\) is the number of chemicals in evaluation set, \(t\) is the correct SMILES, \(p\) is the predicted SMILES, \({t}_{i}\) is the ith letter of the correct SMILES, \(l\left(t\right)\) is the length of \(t\) and \(I\left(x\right)\) is the function that \(x\) is 1 if prediction is correct and 0 otherwise.
Bottleneck layer of ED model is regarded as the low-dimensional representation of chemicals with 256 dimensions and can be regarded as the descriptor. Chemical descriptors are obtained by feeding SMILES to the encoder of the trained model.
Training HTS data ToxCast assay data was predicted with descriptors obtained from encoders as inputs. We selected XGBoost as a representative machine learning method and hyperparameters were optimized using Optuna for each assay and 3-fold cross validation was applied[34, 35]. Optimized hyper parameters and the conditions for optimization are provided in Additional file 1.
The HTS assay was filtered by following criteria: (1) containing more than 7000 experimental chemicals, (2) the ratio of active chemicals to total is higher than 5%, and we extracted 113 assays (listed in Additional file 2). 25% of each assay data was split for the test set and used to evaluate the trained model. We used two evaluation index of model accuracy, AUROC and MCC.
Visualization of chemical space To visualize the distribution of chemicals formed by descriptors, dimensionality reduction was performed using UMAP[36] and descriptors of 292 CMap chemicals were subjected to the algorithm. For understanding the difference of chemical spaces obtained from models with different accuracy, the ECFP of each chemical and Tanimoto coefficient of any two ECFPs was calculated[37]. Based on Tanimoto coefficient, we defined three chemical groups with similar structures: (1) coefficient with estradiol is higher than 0.25 (estrogens) except for Fluvestrant due to long hydrocarbon chain which is inappropriate for a similar structure with estradiol, (2) coefficient with apigenin is higher than 0.25 (flavonoids), (3) coefficient with isoconazole is higher than 0.25 (azoles). Scatter plot of chemicals embedded by UMAP was illustrated and chemicals in the three groups were highlighted. The chemicals in each group are listed in Additional file 3.
Investigation of substructure learning Using one of the 100K compounds set used for model evaluation, we extracted structures that generated valid structures in all models whose accuracy were higher than Model_0.1. MACCS Key and 2048 bit ECFP were calculated for the input and generated structures for each compound, and the agreement was calculated by Tanimoto coefficient[17].
Investigation of structure incorrectly decoded Using one of the 100K compounds set used for model evaluation, the string length and molecular weight of actual SMILES versus that of predicted SMILES that were not correctly decoded during the evaluation of each model were calculated. These are shown in a scatterplot, together with a straight line with y = x (value of actual SMILES is equal to that of predicted SMILES).