Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Virtual screening by a new Clustering-based Weighted Similarity Extreme Learning Machine approach

  • Kitsuchart Pasupa ,

    Roles Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    kitsuchart@it.kmitl.ac.th

    Affiliation Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

  • Wasu Kudisthalert

    Roles Software, Visualization, Writing – original draft

    Affiliation Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

Abstract

Machine learning techniques are becoming popular in virtual screening tasks. One of the powerful machine learning algorithms is Extreme Learning Machine (ELM) which has been applied to many applications and has recently been applied to virtual screening. We propose the Weighted Similarity ELM (WS-ELM) which is based on a single layer feed-forward neural network in a conjunction of 16 different similarity coefficients as activation function in the hidden layer. It is known that the performance of conventional ELM is not robust due to random weight selection in the hidden layer. Thus, we propose a Clustering-based WS-ELM (CWS-ELM) that deterministically assigns weights by utilising clustering algorithms i.e. k-means clustering and support vector clustering. The experiments were conducted on one of the most challenging datasets–Maximum Unbiased Validation Dataset–which contains 17 activity classes carefully selected from PubChem. The proposed algorithms were then compared with other machine learning techniques such as support vector machine, random forest, and similarity searching. The results show that CWS-ELM in conjunction with support vector clustering yields the best performance when utilised together with Sokal/Sneath(1) coefficient. Furthermore, ECFP_6 fingerprint presents the best results in our framework compared to the other types of fingerprints, namely ECFP_4, FCFP_4, and FCFP_6.

Introduction

Drug screening is a process of determining drug candidates that contain relevant biological targets. Recently, computers have been used to speed up the development process in order to reduce the time required to launch drugs onto the market. Moreover, it has a potential savings of millions of dollars compared to testing in vitro. Virtual screening is a set of computational techniques which aims to rank molecule structures in a database [1]. This ensures that chemists can assay molecules which have a higher probability of being active with the relevant biological target first. A conventional technique in virtual screening is called “similarity searching”. It ranks all molecules in a database on the basis of similarity or dissimilarity to a query molecule.

Machine learning techniques are becoming popular in many applications today. They also play an important role in the drug discovery process, e.g. prediction of target structures, and optimisation of hit compounds. Examples of techniques used in the drug discovery process are support vector machine (SVM) [24], binary discriminant analysis [2, 5], artificial neural networks [6], and decision trees [7]. Many techniques used in virtual screening have been well-documented and reviewed in the following references [810]. Among these techniques, SVM is one of the most powerful and popular in this area resulting in an increasing number of publications in recent decades [10].

Although SVM is a powerful algorithm, its main drawback is that it requires quadratic programming to solve the problem–at least the space complexity is quadratic. When the training dataset becomes large, its computational cost will be very intensive. In addition, SVM requires two or more user-specified parameters which directly affect the model’s performance. These parameters are required to be tuned in order to get an optimal model. Thus, the higher the number of parameters to be tuned, the more the computational cost is. In 2004, Extreme Learning Machine (ELM) was proposed by Huang et al. and made use of single hidden layer feed-forward neural network [11]. Their proposed algorithm is fast and able to obtain the optimal solution. It has proved to be competitive with SVM in performance but with a remarkable speed of training compared to SVM. Moreover, ELM requires less human intervention than SVM because the only important parameter is the number of hidden nodes [12, 13]. ELM has been applied to protein sequence classification [1416]. To the best of our knowledge, ELM was first applied to the virtual screening task by [17] as Weighted Tanimoto ELM (WELMJT). The algorithm is customised for 2D binary fingerprint descriptor. WELMJT replaces the activation function in neurons at the hidden layer with the Jaccard/Tanimoto (JT) similarity coefficient. Moreover, instead of randomly selecting hidden nodes with continuous distribution in the conventional ELM, WELMJT randomly selects hidden nodes from the training set. Since there are many available similarity coefficients, we adopt a weighted similarity ELM (WS-ELM) algorithm which employs different similarity coefficients. This is to obtain a suitable similarity coefficient for virtual screening task with 2D fingerprint descriptor.

In addition, WS-ELM performance, like ELM, is not robust due to random weight selection. This problem should be addressed. Therefore, a deterministic assignment of hidden weights shall be considered to increase the robustness of the conventional ELM. We propose an approach to carefully select the weights of WS-ELM. Here, clustering techniques are employed to carefully select the represented candidates of weights. The proposed algorithm the so-called “Clustering based Weight Similarity ELM”(CWS-ELM) is performed and compared to the conventional techniques on well-designed experimental frameworks with one of the most challenging databases–Maximum Unbiased Validation Dataset–which consists of 17 activity classes.

Methods

In this section, we explain all methods used in this work together with our proposed techniques.

Similarity searching

Similarity searching is a technique to find compounds in a database which are structurally similar to a query compound. It compares the query against every single compound in the database and returns a database ranked by similarity score. Its rationale is that the more similar the structures of the molecules are, the higher the chance of them having the same properties. The degree of similarity can be calculated by similarity coefficient. Many coefficients have been introduced and re-introduced as they are in very common use in many applications [5, 18, 19].

In this paper, we investigate 16 coefficients selected from [5, 20, 21] as shown in Table 1. Some coefficients are excluded, e.g. Dice. Dice is monotonic to Jaccard/Tanimoto, therefore they give identical rankings. The similarity s(xi, xj) and dissimilarity d(xi, xj) of two molecules are usually calculated from four different quantities: (i) a: The number of bits set in common to both molecule i and j, (ii) b: The number of bits set in molecule i and unset in molecule j, (iii) c: The number of bits set in molecule j and unset in molecule i, and (iv) d: The number of bits unset in common to both molecule i and j. A combination of these four quantities (a + b + c + d) is equivalent to the number of bits m belonging to molecules i and j. The coefficients are divided into three main groups as follows:

  • Association coefficient is based upon the inner product operation. Most of the ranges are [0, 1] which indicates no similarity and complete similarity.
  • Correlation coefficients measure the degree of correlation between the molecules.
  • Distance coefficients quantify the degree of difference between two objects. The more similar two objects are, the smaller the distance value is. Distance function can be converted to similarity function by d(xi, xj) = 1 − s(xi, xj).
thumbnail
Table 1. Formulas for similarity/dissimilarity coefficients for binary-valued vectors.

https://doi.org/10.1371/journal.pone.0195478.t001

If multiple active molecules (nA) are available, we can calculate the similarity value between a molecule xj in the unranked database and a set of query molecules–for all xi ∈ Actives by, (1)

Extreme Learning Machine

Extreme Learning Machine (ELM) was first proposed by Huang et al. [11]. It is based on a single layer feed-forward neural network architecture. Consider the matrix of m-dimensional sample vectors X = [x1, x2, …, xn]T and a target vector y comprising yi ∈ {−1, +1}. The output of ELM can be defined as a linear sum of weights (βi)–connecting the hidden neurons to the output–associated with the hidden layer outputs. There are l nodes in the hidden layer. The hidden layer outputs use an activation function g(⋅) with a linear combination of input x and synaptic weights (wi) and bias (bi)–connecting the hidden neuron to the input neurons–as function input. Therefore the model can be defined as: (2) where wi = [wi1, …, wim] (randomly generated). Therefore, the activity of the hidden node can be represented as (3) The ELM aims to minimise the mean squared error, (4) where is a predicted target. Thus, Moore-Penrose pseudo-inverse is employed to achieve the optimal solution for this problem. Hence, β can be defined by, (5) The prediction score can be computed from (6)

Weighted Similarity Extreme Learning Machine

The proposed Weighted Similarity ELM (WS-ELM) consists of two functions which are (i) empirical likelihood function–mean squared error–and (ii) penalised likelihood functions–ridge penalty, (7) The activation function g(⋅) in the conventional ELM is replaced by s(⋅, ⋅), hence, the H is represented as, (8) C is a regularisation parameter to control the complexity of the model. w is randomly selected from the training set w ⊂ X instead of randomly selected from a continuous distribution. This is to ensure that the achieved weights are binary, sparse, and have identical dimension span.

The virtual screening task faces a dramatic imbalance between the number of active (nA) and inactive (nI) molecules. In order to deal with this imbalanced class problem, a diagonal Γn×n is defined associated with all training samples. A minority class will be given higher importance than a majority class. Thus, the likelihood function becomes (9) (10) The above likelihood function can be minimised using standard ℓ2-regularised weighted least squares which gives the following solution (11) Instead of calculating HTΓH, we can calculate (γ ⋅ H)T(γ ⋅ H), where . This technique can speed up the computational time [17]. This leads to the solution in Eq 12.

(12)

where . γi can be defined as, (13) The architecture of WS-ELM is shown in Fig 1.

Clustering-based Weighted Similarity Extreme Learning Machine

Due to randomness of weights between input and hidden layers, the prediction of the conventional ELM is not stable. This is applicable to the case of WS-ELM as well because a subset of samples in the training set is randomly selected to represent the weights in WS-ELM. Therefore, a deterministic assignment of hidden weights will be able to improve the performance of the conventional ELM. In order to enable the deterministic approach to this, we utilise cluster analysis methods to organise and summarise data through group prototypes. Thus, we propose a new algorithm called “Clustering-based WS-ELM”(CWS-ELM).

Clustering analysis is an unsupervised learning technique for grouping samples in the space into k groups. It aims to minimise the distance of samples within each cluster while maximising the distance between groups. Many clustering algorithms have been introduced and well-documented [22, 23]. In this paper, we investigate k-mean clustering and support vector clustering algorithms. The rationale behind this selection is the choice of representation of the data for each group. A cluster can be represented by its centroid identified by k-mean clustering algorithm or a set of samples bounding the cluster. Brief details of these two algorithms are explained in the following subsection. The pseudo-code for CWS-ELM is shown in Algorithm 1.

k-mean clustering.

This is the conventional clustering technique which aims to minimise the Euclidean distance between the samples and the centroid in each cluster. The number of clusters (k) must be determined by the user. Instead of Euclidean distance, we can adopt other distance- or similarity-coefficients listed in Table 1 as well. In order to ensure that CWS-ELM will pick a binary weight, we choose a sample that is the closest to the centroid. Thus, the number of nodes used in CWS-ELM is equal to the number of centroids representing all clusters in the training data.

Support vector clustering.

Support Vector Clustering (SVC) is inspired by a well-known algorithm, the so-called “Support Vector Machine”and is introduced by [24]. SVC employs a kernel trick to map all samples into a high dimensional feature space and obtains the smallest sphere which contains the mapped samples. The sphere can be mapped back to the original feature space and forms a set of contours which enclose samples. Samples in the same contour are hosted in the same cluster. Furthermore, any points lying on the boundary of the sphere–cluster boundary–are considered as support vectors. Moreover, embedding a soft margin in SVC can enable the sphere not to enclose all points in it. Thus the algorithm can have the ability to deal with outliers. The similarity function in Table 1 can be adopted as a kernel function similar to [5]. In CWS-ELM, the number of nodes is equivalent to the number of support vectors bounding each clusters in the training data.

Algorithm 1 Clustering-based Weighted Similarity Extreme Learning Machine

1: function CWS-ELM_Train(X, y, method)

2:  switch method do

3:   case 0                     ▹Conventional

4:    W ← Randomly select a subset of X

5:   case 1                   ▹k-mean Clustering

6:    W ← Centroid of each cluster by k-mean

7:   case 2                         ▹SVC

8:    W ← Support vector bounding each cluster by SVC

9:  n ← #samples

10:  nA ← #positive samples

11:  nI ← #negative samples

12:  for i ← 1 to n do

13:   if yi = 1 then

14:    

15:   else

16:    

17:   end if

18:  end for

19:  

20:  

21:  return W, β

22: end funtion

23: function CWS-ELM_Predict(W, β, XTest)

24:  H ← S(XTest, W)

25:  

26:  return t

27: end funtion

Dataset and experiment framework

Maximum Unbiased Validation Dataset

The experiments were conducted on a well-known open to the public dataset in a virtual screening task using the so-called “Maximum Unbiased Validation”(MUV) dataset which was created by the Institute of Pharmaceutical Chemistry, Braunschweig University of Technology, Germany [25]. The dataset consists of 17 bioactivity data sets carefully selected from PubChem–an open archive of the biological activities of millions of molecules as shown in Table 2. Each set consist of 30 active compounds together with 15,000 carefully selected confirmed inactive compounds (also known as decoys). An active compound is a compound which causes a corresponding biological activity while an inactive compound does not. The active compounds in each activity are designed to be structurally heterogeneous (sometimes called diverse) with only 1.14 compounds on average of distinct scaffolds in each activity class. The scaffold is the core structure which is the main component of a molecule. Moreover, the classes are grossly imbalanced with over 99.8% belonging to the inactive group. Therefore, this dataset is one of the most challenging in virtual screening tasks.

thumbnail
Table 2. The 17 activity classes in the MUV dataset.

The entries are ranked in decreasing order of average mean pairwise similarity across four fingerprints.

https://doi.org/10.1371/journal.pone.0195478.t002

We represent the data with two popular fingerprints generated by Pipeline Pilot software, namely: Extended Connectivity Fingerprint (ECFP), and Functional-class Fingerprint (FCFP) [26]. The reason behind the selection of these two types of fingerprints is that Gardiner et al. demonstrated that ECFP and FCFP fingerprints yielded the best two fingerprints among BCI [27], Daylight [28], ECFP, FCFP, MDL [26], and Unity [29] fingerprints in virtual screening tasks [30]. In this work, both ECFP and FCFP fingerprints utilise a circular substructure of four or six diameter bonds–represented as ECFP_4, ECFP_6, FCFP_4, and FCFP_6. All four types of fingerprints have a fixed dimension of 1,024-D.

As mentioned earlier the MUV dataset is very diverse; another widely used indicator for diversity among substructures of molecules in a database is mean pairwise similarity (MPS) score. The lower the score, the more heterogeneous an activity class is. Hence, it will be very difficult to identify/retrieve in a virtual screening task. The MPS of each compound with every other compound in the class, calculated with different fingerprints using the Jaccard/Tanimoto similarity coefficient is shown in Table 2. It can be seen that the MPS on average is only 0.19/1.00.

Experiment settings

The dataset is divided into training and test sets. The training sets are created similarly to [3032]. All 30 active molecules from the 17 activity classes in the MUV dataset are collected in a data pool. Then, we randomly select 170 molecules (nTr) as a training set which consists of 10 active and 160 inactive molecules for each activity class under consideration. A set of the remaining samples in the data pool combined with inactive samples for each activity class under consideration constitute a test set. Active and inactive molecules are labelled as 1 and −1, respectively.

The experiments are divided into three parts as follows:

  • Evaluating WS-ELM in conjunction with different types of available similarity coefficients against the baseline method–similarity searching–on four considered fingerprints in order to obtain the best similarity coefficient suitable for WS-ELM and fingerprint.
  • Comparing our proposed algorithm, CWS-ELM, with ELM variants.
  • Comparing the proposed algorithm with other approaches, i.e. Similarity Searching, SVM, and Random Forests (RF).

All experiments are run 10 times with different random splits on training and test sets.

In addition, hyper-parameters in each algorithm are identified by estimating generalisation error via five-fold cross-validation on the basis of the area under the Receiver Operating Characteristic Curve (AUROC) on the training set. There are many criteria for evaluation of virtual screening tasks, e.g. AUROC, Enrichment Factor (EF), Robust Initial Enhancement (RIE), Boltzmann-Enhanced Discrimination of ROC [3336]; but we select AUROC because it is simple and a standard metric for many fields.

In WS-ELM, there are two parameters which need to be tuned: number of hidden nodes (l) and a regularisation parameter (C). The range of l was [1, …, nTr] while the range of C was [10−6, 10−5, …, 105, 106]. For CWS-ELM, the regularisation parameter (C) is required to be tuned using the same range as WS-ELM. In addition to the base hyper-parameter of WS-ELM, the number of clusters (k) for k-mean-based WS-ELM (CWS-ELMKMC) is required to be within a range of 1 to nTr, while SVC-based WS-ELM (CWS-ELMSVC) has another regularisation CS with a range from 0.1 to 1.0 with increments of 0.1.

A model is trained with the training data with a set of optimal parameters. The model is tested on the test set and evaluated with a widely used performance measure in a virtual screening task–the average proportion of the maximum possible number of active molecules (hit rate) which is retrieved from the top 1% of the ranked database. The molecules are ranked based on the predicted score from the output layer of WS-ELM and its variants. The higher the score, the more likely the molecule is to be active.

All experiments are carried out using the Matlab environment. SVC toolbox is available to download at https://sites.google.com/site/daewonlee/research/svctoolbox and the proposed CWS-ELM can be downloaded at https://github.com/dsmlr/cwselm.

Results and discussions

A comparison of similarity searching and WS-ELM with the 16 similarity coefficients on four types of fingerprint

WS-ELM together with the 16 coefficients and similarity searching were evaluated on the 17 activity classes with four types of fingerprint. The experiment results are shown in Tables 3 and 4 for similarity searching and WS-ELM, respectively. Each element in these tables contains the mean hit rate, when averaged across the four fingerprints and 10 different data splits, in the top 1% of the ranked database. It is clear that Sokal/Sneath(1) could achieve the best performance followed by Jaccard/Tanimoto and Sokal/Sneath(3) coefficients in both similarity searching and WS-ELM techniques. It should be noted that Sokal/Sneath(1) is a modified version of the Jaccard/Tanimoto function which gives double weight to non-matches. The worst similarity coefficients in similarity searching and WS-ELM are Roger/Tanimoto and Yule, respectively.

thumbnail
Table 3. Maximum percentage actives retrieved in top 1% of ranked database using similarity searching technique (average across 10 runs).

Bold face is the best result in each activity class.

https://doi.org/10.1371/journal.pone.0195478.t003

thumbnail
Table 4. Maximum percentage actives retrieved in top 1% of ranked database using WS-ELM technique (average across 10 runs).

Bold face is the best result in each activity class.

https://doi.org/10.1371/journal.pone.0195478.t004

There is a degree of variation in the performance of the 16 similarity coefficients (N objects) by each of the 17 activity classes (k judges). The ranks in Tables 5 and 6 are assigned according to Tables 3 and 4, respectively. The degree of agreement between the rankings assigned can be determined by a statistical analysis called the “Kendall Coefficient of Concordance” [37]. This can be calculated by Eq (14).

(14)

where is the average of the ranks assigned to the i-th object. Tj is a correction factor calculated by Eq (15).

(15)

where ti is the number of tied ranks in the i-th grouping of ties, and gi is the number of groups of ties in the j-th rank. The significance of the computed value of W can be obtained from the table of critical values for N ≤ 7 [37] or from a table of the chi-square distribution with N − 1 degrees of freedom for N > 7. We can calculate chi-square from (16) The computed values of W for similarity searching and WS-ELM are 0.8103, and 0.5161, respectively, which correspond to χ2 values of 206.64, and 131.61, respectively (p < 0.001 for 15 degrees of freedom). Because agreement between various rankings of the same set of activity classes is significant, this leads to the following orderings in the similarity searching case: The rank of the 16 coefficients in WS-ELM case is as follows:

thumbnail
Table 5. Ranks assigned to 16 similarity coefficients–similarity searching–by 17 activity classes from Table 3.

https://doi.org/10.1371/journal.pone.0195478.t005

thumbnail
Table 6. Ranks assigned to 16 similarity coefficients–ELM–by 17 activity classes from Table 4.

https://doi.org/10.1371/journal.pone.0195478.t006

WS-ELM is then compared with the similarity searching technique. According to Tables 3 and 4, WS-ELM can achieve higher maximum percentage actives retrieved at 9.10% than similarity searching does at 7.19% on average across 17 activity classes, 16 similarity coefficients, four fingerprints, and ten runs. The t-test is used to test the significance level of the difference between the means of two independent samples [37]. It is confirmed that WS-ELM can perform better than similarity searching on average at p < 0.001.

Next, the performance of Sokal/Sneath(1) on similarity searching and WS-ELM is analysed. As shown in Tables 3 and 4, similarity searching and WS-ELM can achieve 11.79% and 12.16% of maximum percentage actives retrieved, respectively. However, it is inconclusive that similarity searching with Sokal/Sneath(1) is outperforming WS-ELM with Sokal/Sneath(1) at p = 0.5759.

Fig 2 shows relative improvement or worsening of WS-ELM with respect to similarity searching on average across 16 similarity coefficients, four fingerprints, and 10 runs. The entries are sorted by MPS score. It is hardly surprising that WS-ELM performs better than similarity searching. This is because similarity searching only uses active molecules in its training set while WS-ELM has a proper training set consisting of active and inactive molecules. WS-ELM was more effective than similarity searching in 16 out of 17 cases, especially in the cases with low MPS (heterogeneous). This means that including inactive molecules in the training sets can improve overall performance. However, it might not be very useful in some homogeneous classes, i.e. I01, I02, I5, and I07.

thumbnail
Fig 2. Relative improvement/worsening with respect to similarity searching for top 1% retrieved–average across ten runs, 16 similarity coefficients, and four fingerprints.

https://doi.org/10.1371/journal.pone.0195478.g002

Further analysis is conducted by using a violin plot (as shown in Fig 3) to evaluate the distribution of the results for WS-ELM in conjunction with each similarity coefficient. It is clearly seen that there are two distinct distributions in each coefficient. These two distributions reflect those activity classes with high and low MPS scores. The distribution with higher performance contains activity I1, I2, I3, I4, and I5 with average MPS of 0.21 while distributions with lower hit rate contain the remaining activity classes with average MPS of 0.18. In other words, the five most homogeneous activity classes in the MUV dataset could achieve higher hit rates compared to the others. On the other hand, if the active molecules are very structurally heterogeneous, it is difficult to achieve a high hit rate in that activity class as shown in Fig 4.

thumbnail
Fig 3. Violin plot of maximum percentage of active molecules retrieved in the top 1% with WS-ELM in conjunction with 16 different similarity coefficients–averaged across ten runs, 17 activity classes, and four fingerprints.

https://doi.org/10.1371/journal.pone.0195478.g003

thumbnail
Fig 4. Maximum percentage of active molecules retrieved in the top 1% with WS-ELM and similarity searching in 17 activity classes–averaged across ten runs, 16 similarity coefficients, and four fingerprints.

https://doi.org/10.1371/journal.pone.0195478.g004

Fig 5 shows maximum percentage of actives retrieved with WS-ELM and similarity searching using different fingerprints–averaged across all activity classes and all similarity coefficients. Representing molecules with ECFP_6 fingerprint enables retrieving the most actives on average in both WS-ELM and similarity searching. FCFP_4 fingerprint performs worst on average. Again, Kendall Coefficient of Concordance is applied in order to obtain ordering in four fingerprints–4 objects–and 17 judges. The computed W values are 0.1657 and 0.1352 for WS-ELM and similarity searching, respectively. According to these values, the chi-square values yield 84.5153 and 68.9718, respectively; both are significant at the 0.001 level of statistical significance. This suggests the same orderings in fingerprint case for both WS-ELM and similarity searching:

thumbnail
Fig 5. Maximum percentage of active molecules retrieved with WS-ELM and similarity searching using four different fingerprints–averaged across 17 activity classes, 16 similarity coefficients, and 10 runs.

https://doi.org/10.1371/journal.pone.0195478.g005

A comparison of CWS-ELM and WS-ELM with the best two similarity coefficients

The two best similarity coefficients–Sokal/Sneath(1) and Jaccard/Tanimoto–for MUV dataset from the first part are employed in the proposed CWS-ELM algorithm. The proposed algorithms are compared with WS-ELM on the same framework. Maximum percentage of active molecules retrieved in the top 1% and number of hidden nodes used in the model are reported in Table 7. Each element is an average across four fingerprints and 10 runs. The proposed algorithm is reported as CWS-ELMKMC and CWS-ELMSVC for CWS-ELM in conjunction with k-means clustering and SVC, respectively.

thumbnail
Table 7. The percentage hit rate in the top 1% of the ranked database retrieved by WS-ELM and CWS-ELM in conjunction with Jaccard/Tanimoto (JT) and Sokal/Sneath(1) (SN1).

Figures in bold face represent the best performance.

https://doi.org/10.1371/journal.pone.0195478.t007

In the overall picture, the proposed CWS-ELM yields the highest performance measure in 15/17 activity classes. The best technique is CW-ELMSVC in conjunction with Sokal/Sneath(1) which achieves the best percentage of active molecules retrieved in 9/17 cases at 13.02% on average across all activity classes, but it requires the highest number of nodes in the hidden layer at 71.0%. This is followed by CW-ELMSVC in conjunction with Jaccard/Tanimoto which achieved a 12.03% hit rate and exhibited high accuracy in 4/17 cases. However, CW-ELMKMC’s performance is slightly worse than WS-ELM because it contains a smaller number of hidden nodes on average than WS-ELM. The correlation coefficient between the mean percentage of hit rates and number of nodes used in the model is 0.93 which is considered very highly correlated. Due to the high degree of diversity in the dataset, therefore, the number of nodes in the hidden layer can directly affect the performance of the model. If the model is too simple, it can degrade the performance of the classifier.

As our proposed algorithm embeds two clustering techniques to select the represented samples in WS-ELM–one selects the centroids of the clusters and the other utilises support vectors bounding the clusters, they are different in nature. Considering the same number of clusters in the space, SVC requires more than one support vector to bound and identify the cluster while k-means clustering needs only one centroid to represent the cluster. Therefore, there is a high chance of SVC performing better than k-means clustering in this dataset as they are very diverse.

Again, we applied the Kendall Coefficient of Concordance to test the significance on the ranking of six contenders in Table 8. The computed W is 0.2581–leading to a χ2 of 21.94–which indicates that the results are highly statistically significant. This gives the following ranking: It is clear that CWS-ELMSVC is the best contender among all while the worst is WS-ELMJT.

thumbnail
Table 8. Ranks assigned to the performances of 6 classifiers by 17 activity classes from Table 7.

https://doi.org/10.1371/journal.pone.0195478.t008

Furthermore, the effect of the number of nodes in the hidden layer on the performance is investigated. The most homogeneous (I1) and the most diverse (I17) classes in the dataset with ECFP_6 fingerprint are evaluated. The regularisation parameter for each model–with a different number of nodes used–is tuned by five-fold cross validation on the basis of AUROC. Again, the experiment is conducted ten times with different random splits. In this experiment, only WS-ELM and CWS-ELMKMC are evaluated because the number of hidden nodes of these two can be directly adjusted and compared. Unlike the CWS-ELMSVC, the number of nodes depends on Cs. The results of I01 and I17 are displayed in Figs 6 and 7, respectively.

thumbnail
Fig 6. Effect of AUROC when the number of hidden nodes in WS-ELM and CWS-ELMKMC is changed in activity class I01.

Solid lines represent mean values while shaded areas represent error/confidence bounds. The upper and lower bounds of each node are based on the standard deviation.

https://doi.org/10.1371/journal.pone.0195478.g006

thumbnail
Fig 7. Effect of AUROC when the number of hidden nodes in WS-ELM and CWS-ELMKMC is changed in activity class I17.

Solid lines represent mean values while shaded areas represent error/confidence bounds. The upper and lower bounds of each node are based on the standard deviation.

https://doi.org/10.1371/journal.pone.0195478.g007

It is clear that CWS-ELMKMC is better than WS-ELM in number of actives retrieved when a small number of nodes is used (1–28%) in the model for I17 as shown in Fig 7. Moreover, it is more robust than WS-ELM resulting in smaller standard deviations in the performances. This means that carefully selected samples in the hidden node is important. According to Fig 6, CWS-ELMKMC is comparable to WS-ELM in I01. Comparing performances of the classifiers in both activity classes, AUROC of I01 achieves convergence at 15% of number of nodes used while the convergence of AUROC in I17 occurs at 30% of number of nodes used. This shows that the performance of classifiers on I01 can achieve convergence quicker than I17.

We also show an enrichment plot which is a very useful method for evaluating the quality of virtual screening methods. It is a cumulative sum plot of the active molecules retrieved from the top 1% of the ranked database. Figs 8 and 9 show enrichment plots for I01 and I17, respectively. Clearly, CWS-ELM’s performances are better than the conventional WS-ELMJT in both I01 and I17. Performances by all methods on I01, the most homogeneous activity class, are better than on I17, the most heterogeneous activity class.

thumbnail
Fig 8. Enrichment plot for the top 1% of the sorted library for each performer with ECFP_6 fingerprint on activity class I01.

https://doi.org/10.1371/journal.pone.0195478.g008

thumbnail
Fig 9. Enrichment plot for the top 1% of the sorted library for each performer with ECFP_6 fingerprint on activity class I17.

https://doi.org/10.1371/journal.pone.0195478.g009

In addition to comparing the overall performance results by using enrichment plots, the individual molecules that are being retrieved are shown in Figs 10 and 11 for activity I01 and I17, respectively. It can be seen that CWS-ELMSVC-SN is the best in I01. Basically, any molecules retrieved by other approaches can be retrieved in the top 1% of the list by CWS-ELMSVC-SN but in different orders. This is because Sokal/Sneath(1) is a modified version of the Jaccard/Tanimoto function as mentioned in the previous section. In I17 case, WS-ELMSN fails to retrieve any active molecules in the top 1% while the other methods can retrieve one or two active molecules.

thumbnail
Fig 10. Molecules retrieved by different methods in top 1% of the ranked database for activity class I01.

https://doi.org/10.1371/journal.pone.0195478.g010

thumbnail
Fig 11. Molecules retrieved by different methods in top 1% of the ranked database for activity class I17.

https://doi.org/10.1371/journal.pone.0195478.g011

A comparison of CWS-ELM and WS-ELM with the best similarity coefficient against other approaches

The proposed methods CWS-ELM and WS-ELM are compared against other approaches, namely SVM, RF, and Similarity Searching. Apart from RF, all other methods are based on Sokal/Sneath(1) coefficient. The hyper-parameters of SVM and RF are tuned with the same framework as the proposed methods. As mentioned earlier, there are many criteria to evaluate the algorithms but, in the previous experiments, AUROC is chosen for its simplicity, and the percentage hit rate in the top 1% which gives the same picture as EF. However, AUROC has been criticised because it is a global measure that does not pay attention to the top-ranked molecules, therefore Truchon & Bayly proposed a generalised ROC metric called “Boltzmann-Enhanced Discrimination of ROC” (BEDROC) which considers the early recognition problem [34].

However, the best approaches to evaluate the virtual screening task are recommended [35, 38], and EF gives very much the same results as BEDROC but is easier to understand [36]. Therefore we follow the evaluations suggested in [35] by reporting the following measures: (i) EF at 0.5%, 1.0%, 2.0%, and 5.0%, and (ii) The ratio of true positive to false positive rates at 0.5%, 1.0%, 2.0%, and 5.0%. Fig 12 shows EF and the ratio of true positive to false positive rates at the top 0.5%, 1.0%, 2.0%, and 5.0% of the ranked database. Both criteria display the same overall picture. CWS-ELMSVC is still the best contender among all other algorithms followed by SVM at EF0.5% and EF1.0%. The worst is similarity search technique as expected. These are confirmed by Kendall Coefficient of Concordance (with N = 6 and k = 17)–W values are 0.2186 (p < 0.01) and 0.1576 (p < 0.05) for EF at 0.5% and 1.0%, respectively–and lead to the following rankings.

Unfortunately, values of W are not significant at p = 0.05 level in the case of EF of 2.0% and 5.0%.

thumbnail
Fig 12. Early recognition criteria suggested by [35, 38].

(Left) EF (Right) Ratio of true positive rate to the false positive rate, at 0.5%, 1.0%, 2.0%, and 5.0% of the ranked database for WS-ELM and its variants, SVM, RF, and Similarity Searching (SS). Each bar represents the mean value across all activity classes and ten runs.

https://doi.org/10.1371/journal.pone.0195478.g012

Furthermore, we also evaluate the task with BEDROC and a parameter α which relates to the number of considered top ranked molecules in the database. The higher the value of α is, the smaller the considered number of molecules is. As we are interested in the top 1% of the ranked database, α is equal to 160.9 (refer to [34]). The BEDROC results are shown in Fig 13 with EF1.0%. Again, testing the results with Kendall Coefficient of Concordance (W = 0.1585) gives the following ranking at p = 0.05: Moreover, Fig 13 also shows that EF1.0% correlates with BEDROC(160.9) with correlation coefficient of 0.9917. Although EF and BEDROC are strongly correlated, EF does not take into account the ratio of active and inactive molecules while BEDROC does.

thumbnail
Fig 13. Bar charts showing mean EF and BEDROC at 1.0% of the ranked database for WS-ELM and its variants, SVM, RF, and Similarity Seaching (SS).

According to Truchon & Bayly, the top 1% of the ranked database is equivalent to α = 160.9 of BEDROC [34].

https://doi.org/10.1371/journal.pone.0195478.g013

Conclusion

This study proposes a modified ELM, termed WS-ELM, which improves the overall performance of virtual screening tasks. It demonstrates the capability of WS-ELM on the MUV dataset which is known as one of the most challenging datasets. The results show that Sokal/Sneath(1) and Jaccard/Tanimoto are the two best performers in this task among 16 similarity coefficients. Moreover, statistical analysis shows that using the ECFP fingerprint is better than the FCFP fingerprint, and utilising a circular substructure of six diameter bonds is generally better than four diameter bonds. Because of random generation of the weights in hidden nodes, it is not able to guarantee the stability and robustness of WS-ELM. This can lead to a lack of accurate prediction. Thus, WS-ELM is extended as CWS-ELM which adopts a clustering algorithm to enhance its performance, namely k-mean clustering and SVC, to carefully select weights in hidden nodes instead of randomly. Experimental results confirm that CWS-ELM performances are better and more robust than WS-ELM. CWS-ELMSVC-SN is the best approach which is consistently listed in the top ranks compared with its variants and other machine learning techniques.

Acknowledgments

The authors would like to thank Prof. Peter Willett of Information School, University of Sheffield for his valuable comments and suggestions to improve the quality of the paper.

References

  1. 1. Leach AR, Gillet VJ. An Introduction to Chemoinformatics. Springer Netherlands; 2007.
  2. 2. Wilton DJ, Harrison RF, Willett P, Delaney J, Lawson K, Mullier G. Virtual Screening Using Binary Kernel Discrimination: Analysis of Pesticide Data. Journal of Chemical Information and Modeling. 2006;46(2):471–477. pmid:16562974
  3. 3. Pasupa K, Hussain Z, Shawe-Taylor J, Willett P. Drug Screening with Elastic-Net Multiple Kernel Learning. In: Proceeding of the 13th IEEE International Conference on BioInformatics and BioEngineering (BIBE 2013), 10–13 November 2013, Chania, Greece; 2013. p. 1–5.
  4. 4. Kurczab R, Bojarski AJ. The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening. PloS One. 2017;12(4):e0175410. pmid:28384344
  5. 5. Chen B, Harrison RF, Pasupa K, Willett P, Wilton DJ, Wood DJ, et al. Virtual Screening Using Binary Kernel Discrimination: Effect of Noisy Training Data and the Optimization of Performance. Journal of Chemical Information and Modeling. 2006;46(2):478–486. pmid:16562975
  6. 6. Myint KZ, Wang L, Tong Q, Xie XQ. Molecular Fingerprint-Based Artificial Neural Networks QSAR for Ligand Biological Activity Predictions. Molecular Pharmaceutics. 2012;9(10):2912–2923. pmid:22937990
  7. 7. Han L, Wang Y, Bryant SH. Developing and validating predictive decision tree models from mining chemical structural fingerprints and high-throughput screening data in PubChem. BMC Bioinformatics. 2008;9(1):401. pmid:18817552
  8. 8. Muegge I, Oloff S. Advances in virtual screening. Drug Discovery Today: Technologies. 2006;3(4):405–411.
  9. 9. Pasupa K. The Review of Virtual Screening Techniques. KMITL Journal of Information Technology. 2012;1(1):60–82.
  10. 10. Lima AN, Philot EA, Trossini GHG, Scott LPB, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opinion on Drug Discovery. 2016;11(3):225–239. pmid:26814169
  11. 11. Huang GB, Zhu QY, Siew CK. Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceeding of IEEE International Joint Conference on Neural Networks (IJCNN’2004); 2004. p. 985–990.
  12. 12. Huang GB, Chen L, Siew CK. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transaction on Neural Networks. 2006;17(4):879–892.
  13. 13. Chorowski J, Wang J, Zurada JM. Review and performance comparison of SVM-and ELM-based classifiers. Neurocomputing. 2014;128:507–516.
  14. 14. Lan Y, Soh YC, Huang GB. Extreme learning machine based bacterial protein subcellular localization prediction. In: Proceeding of IEEE International Joint Conference on Neural Networks (IJCNN’2008); 2008. p. 1859–1863.
  15. 15. Wang G, Zhao Y, Wang D. A Protein Secondary Structure Prediction Framework Based on the Extreme Learning Machine. Neurocomputing;.
  16. 16. Cao J, Xiong L. Protein sequence classification with improved extreme learning machine algorithms. BioMed Research International. 2014;2014:1–12.
  17. 17. Czarnecki WM. Weighted Tanimoto Extreme Learning Machine with Case Study in Drug Discovery. IEEE Computational Intelligence Magazine. 2015;10(3):19–29.
  18. 18. Ellis D, Furner-Hines J, Willett P. Measuring the degree of similarity between objects in text retrieval systems. Perspectives in Information Management. 1993;3(2):128–149.
  19. 19. Patra BK, Launonen R, Ollikainen V, Nandi S. A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data. Knowledge-Based Systems. 2015;82:163–177.
  20. 20. Holliday JD, Hu CY, Willett P. Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-Strings. Combinatorial Chemistry & High Throughput Screening. 2002;5(2):155–166.
  21. 21. Pasupa K. Data Mining and Decision Support in Pharmaceutical Databases. Department of Automatic Control & Systems Engineering, University of Sheffield; 2007.
  22. 22. Jain AK, Murty MN, Flynn PJ. Data Clustering: A Review. ACM Computing Survey. 1999;31(3):264–323.
  23. 23. Xu D, Tian Y. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science. 2015;2(2):165–193.
  24. 24. Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. Journal of machine learning research. 2001;2:125–137.
  25. 25. Rohrer SG, Baumann K. Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data. Journal of Chemical Information and Modeling. 2009;49(2):169–184. pmid:19434821
  26. 26. Accelrys Inc. Pipeline Pilot Software; 2017. Available from: http://www.accelrys.com.
  27. 27. Digital Chemistry Ltd. BCI fingerprints; 2017. Available from: http://www.digitalchemistry.co.uk.
  28. 28. Daylight Chemical Information Systems, Inc. Daylight fingerprints; 2017. Available from: http://www.daylight.com.
  29. 29. Certara, LP. Unity fingerprints; 2017. Available from: https://www.certara.com/.
  30. 30. Gardiner EJ, Holliday JD, O’Dowd C, Willett P. Effectiveness of 2D fingerprints for scaffold hopping. Future Medicinal Chemistry. 2011;3(4):405–414. pmid:21452977
  31. 31. Kudisthalert W, Pasupa K. A Coefficient Comparison of Weighted Similarity Extreme Learning Machine for Drug Screening. In: Proceeding of the 8th International Conference on Knowledge and Smart Technology (KST 2016), 3–6 February 2016, Chiang Mai, Thailand; 2016. p. 43–48.
  32. 32. Kudisthalert W, Pasupa K. Clustering-based Weighted Extreme Learning Machine for Classification in Drug Discovery Process. In: Hirose A, Ozawa S, Doya K, Ikeda K, Lee M, Liu D, editors. Proceeding of the 23nd International Conference on Neural Information Processing (ICONIP 2016), 16–21 Oct 2016, Kyoto, Japan. vol. 9948 of Lecture Notes in Computer Science; 2016. p. 441–450.
  33. 33. Edgar SJ, Holliday JD, Willett P. Effectiveness of retrieval in similarity searches of chemical databases: a review of performance measures. Journal of Molecular Graphics and Modelling. 2000;18(4–5):343–357. pmid:11143554
  34. 34. Truchon JF, Bayly CI. Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. Journal of chemical information and modeling. 2007;47(2):488–508. pmid:17288412
  35. 35. Jain AN, Nicholls A. Recommendations for evaluation of computational methods. Journal of computer-aided molecular design. 2008;22(3–4):133–139. pmid:18338228
  36. 36. Riniker S, Landrum GA. Open-source platform to benchmark fingerprints for ligand-based virtual screening. Journal of Cheminformatics. 2013;5(1):26. pmid:23721588
  37. 37. Siegel S, Castellan NJ. Nonparametric statistics for the behavioral sciences. 2nd ed. McGraw-Hill; 1988.
  38. 38. Nicholls A. What do we know and when do we know it? Journal of computer-aided molecular design. 2008;22(3–4):239–255. pmid:18253702