Abstract

Protein-protein interactions (PPIs) play a crucial role in cellular processes. In the present work, a new approach is proposed to construct a PPI predictor training a support vector machine model through a mutual information filter-wrapper parallel feature selection algorithm and an iterative and hierarchical clustering to select a relevance negative training set. By means of a selected suboptimum set of features, the constructed support vector machine model is able to classify PPIs with high accuracy in any positive and negative datasets.

1. Introduction

Protein-protein interactions (PPIs) play a greatly important role in almost any biological function carried out within the cell [1, 2]. In fact, an enormous effort has already been made to study biological protein networks in order to understand the main cell mechanisms [35]. The development of new technologies has improved the experimental techniques for detecting PPIs, such as coimmunoprecipitation (CoIP), yeast two-hybrid (Y2H), or mass spectrometry studies [69]. However, computational approaches have been implemented for predicting PPIs because of cost and time requirements associated with the experimental techniques [5].

Therefore, different computational methods have been applied in PPI prediction, some methods are Bayesian approaches [1012], maximum likelihood estimation (MLE) [13, 14], maximum specificity set cover (MSSC) [4], decision trees [15, 16], and support vector machines (SVM) [1518]. Many computational approaches use information from diverse sources at different levels [5]. In this way, predicting PPI models [4, 13, 15, 16, 19] have been built using domain information. Since interacting proteins are usually coexpressed and colocated in the same subcellular compartment [10], cell location patterns are also considered to be a valid criterion in prediction works. In other works, authors use functional similarity to predict interacting proteins [20]. Likewise, the concept of homology has been already used to generate prediction models [19, 21], homologs interactions databases [11], and negative datasets [22].

In the past years, these experimental methods [23] and computational approaches [22] have provided interactions for several organisms such as Saccharomyces cerevisiae (S. cerevisiae or Baker’s yeast or simply yeast) [2427], Caenorhabditis elegans (C. elegans) [28, 29], Drosophila melanogaster (D. melanogaster or fruit fly) [30, 31], including Homo Sapiens (H. sapiens) [3, 6, 32].

In spite of obtaining a huge amount of interaction data through high-throughput technologies, it is still difficult to compare them as they contain a large number of false positives [11, 22]. Some authors provide several reliable interaction sets, including diverse confidence levels. With this context, supervised learning methods used in PPI prediction require complete and reliable datasets formed by positive and negative samples. However, noninteracting pairs are rarely reported by experimentalists motivated by the difficulties associated in demonstrating noninteraction under all possible conditions. In fact, negative datasets have traditionally been created by randomly paired proteins [15, 33, 34] or by selecting pairs of proteins that are not sharing the same subcellular compartment [10]. Nonetheless, other works suggest that negative sets created on the basis of cell location alone lead to biased estimations in the predictive interacting models [17]. To solve this problem, Wu et al. [35] proposed a predictive interacting method by means of similarity semantic measures [36], based on gene ontology (GO) annotations [37], although they did not specify which ontology contributed most to the process of obtaining negative interactions. For this reason, Saeed and Deane [22] introduced a novel method to generate negative datasets, based on functional data, location, expression, and homology. These authors considered noninteracting pairs to be two proteins showing no overlapping between any of the features under consideration. In another work, Yu et al. [38] demonstrated that the accuracy of the PPI prediction works is significantly overestimated. The accuracy reported by the prediction model strongly depends on the structure of the selected training and testing datasets. The chosen negative pairs in the training data have a variable impact on the accuracy, and it can be artificially inflated by a bias towards dominant samples in the positive data [38]. In this way, Yu et al. [38] also presented a method for the selection of unbiased negative examples based on the frequency of the proteins involved in positive interactions in the dataset.

In this work, a novel method is presented for constructing an SVM classifier for PPI prediction, selecting negative dataset through clustering approach applied to 4 million negative pairs from Saeed and Deane [22]. This clustering approach is applied in an effort to avoid the impact of negative dataset on the accuracy of the classifier model. This new method is based on a new feature extraction and selection using well-known databases, applied specifically to a yeast organism model, since yeast is the most widely analysed organism and the one in which it is easiest to find data. New similarity semantic measures calculated from the features are proposed, and they demonstrate that their use improves the predictive power of trained classifiers. In addition, this classifier may return a confidence score for each PPI prediction through a modification of the SVM implementation. Firstly, features are extracted for positive and negative samples; then, a clustering approach is performed in order to obtain high-reliable noninteracting representative samples. Subsequently a parallel filter-wrapper feature technique selects the most relevance extracted features in order to obtain a reliable model. The algorithm called mRMR (minimal-redundancy-maximal-relevance criterion) [39] is used as filter and is based on the statistical concept of mutual information. This reduction in the number of features allows for a better training efficiency as the search space for most of the parameters of the model is also reduced [40, 41].

In a second part, with the purpose of validating the generalisation capability of our model, a group of highly reliable external datasets from [9] were classified using our method. These datasets to be validated were extracted using computational and experimental approaches together with information from the literature. The used models are SVM classifiers built using the most relevance selected features that characterise the protein-protein interaction as explained. They were trained using three training sets, the positive examples were kept, but the negative set was changed, each negative set was obtained by a specific method: (1) hierarchical clustering method presented in this paper, (2) randomly selection, and (3) using the approach proposed by Yu et al. [38].

The testing datasets were filtered for assessment to prevent biased results, that is, without any overlapping between the datasets used during the training stage. High sensitivity and specificity are obtained in both parts using this proposed approach, that is, the model trained using the negative set by the proposed hierarchical clustering method. The presented approach leads to the possibility of becoming a guide for experimentation, being a useful tool to save money and time.

2. Material and Methods

2.1. Material

Two types of datasets were used: training datasets to construct the models and testing datasets to assess the goodness of predictions. A supervised learning classifier as SVM requires positive and negative samples for training data. The positive and negative examples were extracted from Saeed and Deane [22], where authors provide a positive dataset composed of 4809 high-reliability interacting pairs of proteins and a high-quality negative set formed by more than 4 million noninteracting pairs. Two negative subsets of the size similar to that of the positive dataset were extracted from this negative set: one dataset is composed of randomly selected noninteraction pairs (4894) and the other one is created by means of the proposed hierarchical clustering approach presented in this paper in order to select the most representative negative samples (4988). The main goal of this negative dataset of clustered samples is to represent the whole negative space of more than 4 million examples avoiding biased results in PPI prediction. The third negative set used in this paper is created using the method proposed by Yu el at. [38], which is “balanced’’ to the taken positive set. A comparison of the PPI classification results training three models using these negative datasets is shown Section 3. During the training phase, the positive dataset is called gold standard positive (GSP) set and the used negative dataset is called gold standard negative (GSN) set.

In the case of testing datasets, these were selected for the sake of validating the generalisation capability of the proposed approach in PPI prediction. A group of reliable binary interaction datasets (LC-multiple, binary-GS, Uetz-screen, and Ito-core) were taken from Yu et al. [34]. These datasets have been obtained using several approaches from experimentally, computationally, and grouping datasets well known in the literature. These datasets can be freely downloaded from the website http://interactome.dfci.harvard.edu/. Besides, another group of used negative testing datasets is also described here. So all proposed testing datasets are the follwing.(i)The LC-Multiple Dataset. It is composed of literature-curated interactions supported by two or more publications. There are 2855 positive interactions. (ii)Binary-GS dataset. It is a binary gold standard set that was assembled through a computational quality reexamination that includes well-established complexes, as well as conditional interactions and well-documented direct physical interactions in the yeast proteome. There are 1253 positive interactions. (iii)Uetz-screen. It is the union of sets found by Uetz et al. in a proteome-scale all-by-all screen [24]. There are 671 positive interactions. (iv)Ito-core. It is Interactions found by Ito et al. that appear three times or more [25]. There are 829 positive interactions. (v)Random Negative Dataset 1, 2. Due to the low number of noninteracting protein data within the RRS set, three negative subsets of similar size of the proposed GSP have been utilised. These set are denoted, random dataset negative 1 (4896 pairs) and random dataset negative 2 (4898 pairs), and were also randomly selected from the Saeed and Deane negative set [22]. (vi)Negative Datasets Obtained Using the Proposed Hierarchical Clustering Approach. The negative datasets obtained in the last step of the hierarchical clustering process were used as testing negative datasets. In total there are 9 datasets of 5000 examples (see Section 3).

For all the datasets, a feature extraction process was applied and the data obtained through this process were normalised in the range [0,1] to apply the proposed method. Furthermore, in a previous step to the evaluation of our model, those interactions from every testing dataset were filtered out to remove overlapping with the training set. In this way, the possible overestimated classification accuracy is prevented through a clustering process selecting a representative negative dataset and a filtering step.

2.2. Feature Extraction

Feature extraction process for the proposed datasets was applied using well-known databases in proteomics, especially for yeast model organism. The calculated features cover different proteomic information integrating diverse databases: Gene Ontology Annotation (GOA) Database [42], MIPS Comprehensive Yeast Genome Database (MIPS CYGD) [43], Homologous Interactions database (HINTdb) [11], 3D Interacting Domains database (3did) [44], and SwissPfam (SwissPfam is an annotated description of how Pfam domains map to possibly multidomain SwissProt entries) [45].

Essentially, the presented approach in this paper integrates distinct protein features to design a reliable classifier of PPIs. The importance of protein domains in predicting PPIs has been already proved [4, 13, 19], so the use of SwissPfam and 3did databases was included in this process. The MIPS CYGD catalogues that cover functional, complexes, phenotype, proteins, and subcellular compartments information about yeast make it a very useful tool in PPI analysis [10, 11]. Likewise, GO data has been successfully applied in classification models [46] and so has the usage of similarity measures supporting PPI prediction [35]. Furthermore, the “interlogs’’ concept helps to design new approaches in proteomics such as PPI prediction, classification, and creation of reliable PPI databases [11, 22, 28]. Therefore, the HINTdb database was included in our study.

The main step in this process is the extraction of a set of features that can be associated with all possible combinations of pairs of proteins. The fundamental idea about feature extraction here consists of computing how many common terms are shared between two proteins (a given pair) in any given database. Those features would be our “basic” features, with every feature being calculated as the number of common terms that are shared by a pair of proteins in a specific database.

Although the extraction process integrates several information sources, these features in themselves do not provide enough information to estimate whether any given pair of proteins are very likely to interact [10]. Thus, reinforcing the predictive power of classification models through a specific combination of features, two new similarity measures called local and global were incorporated in this process as “extended” features. The definition of these two similarity measures would be the following.

Given a pair of proteins (protA, protB) and leting 𝐴 be the set of all terms linked for protein protA and 𝐵 the set of terms linked for protein protB in a specific database, the local similarity measure for (protA, protB) is defined assimlocal=#(𝐴𝐵)#(𝐴𝐵),(2.1) where #(𝐴𝐵) represents the number of common terms in a specific database for (protA, protB) and #(𝐴𝐵) is the total number of all terms in the union of sets 𝐴 and 𝐵.

In a similar way, the global similarity measure is calculated as the ratio of common terms shared by a given pair (protA, protB) with respect to the sum of all terms in a specific database. This measure is calculated assimglobal=#(𝐴𝐵),#𝐶(2.2) where 𝐶 is the total number of terms in the complete database.

Hence, a further description of each considered database detailing the feature calculation and extraction for a given pair of proteins is summarised in Table 4. For the sake of clarity, in the following enumeration, the same information indicating between parenthesis the type of data (integer or real) and the order in the feature list is also explained.(i)Gene Ontology Annotation (GOA) Database [42] that provides high-quality annotation of gene ontology (GO) [37]. The GO project was developed to give a controlled vocabulary for the annotations of molecular attributes in different model organisms. These annotations are classified in GOA into three structured ontologies that describe molecular function (F), biological process (P), and cellular component (C). Each ontology is organised as a directed acyclic graph (DAG). We extract the IDs (identifiers) of the GO terms associated with each protein calculating the common GO annotation terms between both proteins in the three ontologies (P, C, and F) (1st integer) and their local and global similarity measures (12th real, 13th real). Moreover, we considered each ontology separately (4th P integer, 5th C integer, and 6th F integer) and their respective local (15th real, 16th real, and 17th real) and global similarity measures (18th real, 19th real, and 20th real). (ii)Homologous Interactions database (HINTdb) [11] is a collection of protein-protein interactions and their homologs in one or more model organisms. Homology refers to any similarity between characteristics that is because of their shared ancestry. The number of homologs between both proteins obtained from HintDB is the 2nd feature (integer).(iii)MIPS Comprehensive Yeast Genome Database (MIPS CYGD) [43] gathers information on molecular structure and functional network in yeast. All catalogues are considered: functional, complexes, phenotype, proteins, and subcellular compartments. Considering each MIPS catalogue separately, the number of common terms (using the catalogue identifier) is calculated between both proteins (functional 7th integer, complexes 8th integer, proteins 9th integer, phenotypes 10th integer, and subcellular compartments 11th integer). Moreover, their local similarity measures are considered (21st real, 22nd real, 23rd real, 24th real, 25th real). (iv)3D Interacting Domains database (3did) [44] is a collection of domain-domain interactions in proteins for which high-resolution three-dimensional structures are known in the Protein Data Bank (PDB) [47]. 3did exploits structural information to support critical molecular details necessary for better understanding how interactions occur. This database also provides an overview of how similar in structure are interactions between different members of the same protein family. The database also stores gene ontology-based functional annotations and interactions between yeast proteins from large-scale interaction discovery analysis. The 3rd feature (integer) is calculated as the common Pfam domains between both proteins, extracted from SwissPfam, which are found in the 3did database. The 3rd feature divided by the total Pfam domains that are associated with both proteins is the 14th feature (real).(v)SwissPfam [45] from UniProt database [48] is a compilation of domain structures from SWISSPROT and TrEMBL [45] according to Pfam [49]. Pfam is a database of protein families that stores their annotations and multiple sequence alignments created using hidden Markov models (HMM). No feature is directly associated with this database, but it is used in combination with the 3did database to calculate the 3rd and 14th features.

2.3. Feature Selection: Mutual Information and mRMR Criterion

In pattern recognition theory, patterns are represented by a set of variables (features) or measures. Such pattern is a point in an 𝑛-dimensional features space. The main goal is to select features that distinguish uniquely between patterns of different classes. Normally, the optimal set of features is unknown and commonly has an irrelevant number or redundant features. In this way, through a pattern recognition process, these irrelevant or redundant features are filtered out greatly improving the learning performance of classifiers [40, 41]. This reduction in the number of features, also known as feature selection, allows to simplify the model complexity, as it gives a better visualisation and understanding of used data [50]. In this work, we consider the PPI prediction as a classification problem, so each sample point represents a pair of proteins that must be classified into one out of two possible classes: noninteracting or interacting pair.

The feature selection algorithm can be classified in two groups: filter and wrapper [50, 51]. The filter methods choose a subset of features by means of a preprocessed step independently of used machine learning algorithm. The wrapper methods use the classifier performance to assess the goodness of the features subset. Other authors have utilised a combination of filter and wrapper algorithms [39]; in fact, in this work, a combination between filter and wrapper is used. First, a filter method is applied in order to obtain the relevance of features and subsequently a wrapper method is performed using support vector machine models from the obtained relevance order.

Different criteria have been applied to evaluate the goodness of a feature [50, 52]. In this case, the proposed filter features selection method is based on mutual information as relevance measure and redundancy between the features through minimal-redundancy-maximal-relevance criterion (mRMR) proposed by Peng et al. [39].

Let 𝑋 and 𝑌 be two random continuous variables with marginal pdfs 𝑝(𝑥) and 𝑝(𝑦), respectively, and joint probability density function (pdf) 𝑝(𝑥,𝑦). The mutual information between 𝑋 and 𝑌 can be represented as [50, 53]. 𝐼(𝑋,𝑌)=𝑝(𝑥,𝑦)log𝑝(𝑥,𝑦)𝑝(𝑥)𝑝(𝑦)𝑑𝑥𝑑𝑦.(2.3)

In the case of discrete variables, the integral operation is reduced to a summation operation. Let 𝑋 and 𝑌 be two discrete variables with mathematical alphabets 𝒳 and 𝒴, marginal probabilities 𝑝(𝑥) and 𝑝(𝑦), respectively, and a joint probability mass function 𝑝(𝑥,𝑦). The MI between 𝑋 and 𝑌 is expressed as [50]𝐼(𝑋,𝑌)=𝑥𝒳𝑦𝒴𝑝(𝑥,𝑦)log𝑝(𝑥,𝑦)𝑝(𝑥)𝑝(𝑦).(2.4)

The mutual information (MI) has two principal properties that make it different from other dependency measures: (1) the capacity of measuring any relationship between variables and (2) its invariance under space transformations [50, 54].

For mRMR, authors considered mutual-information-based feature selection for both discrete and continuous data [39]. The MI for continuous variables was estimated using the Panzer Gaussian windows [39]. Peng et al. show that using a first-order incremental search (as a feature is selected in a time), the mRMR criterion is equal to maximum dependence, or, in other words, estimating the mutual information 𝐼(𝐶,𝑆) between class variable 𝐶 and subset of selected features 𝑆. In Peng el al. [39], for minimizing the classification error in the incremental search algorithm, mRMR method is combined with two wrapper schemes. In a first stage, the method is used with the purpose of finding the candidate feature set. In a second stage, backward and forward selections were applied in order to find the compact feature set through the candidate feature set that minimises the classification error.

Given class variable 𝐶, the initial set of features 𝐹, an individual feature 𝑓𝑖𝐹, and a subset of selected features 𝑆𝐹, the mRMR criterion for the first-order incremental search can be expressed as the optimisation of the following condition [39, 50]:𝐼𝐶;𝑓𝑖=1||𝑆||𝑓𝑠𝑆𝐼𝑓𝑠;𝑓𝑖,(2.5) where |𝑆| is the cardinality of the selected feature set 𝑆, 𝑓𝑠𝑆.

This filter mRMR method is a fast and efficient method because of its incremental nature, showing better feature selection and accuracy in classifier including wrapper approach [39, 50].

In this work, mRMR criterion method was used as filter algorithm with the purpose of obtaining the relevance of proposed features. Subsequently, an SVM model is trained for each incremental combination of features in ascending order of relevance. Such combination of features is applied adding a feature in a time according to the relevance, starting from the most relevant one, and adding the next most relevant one until feature 25. In total, 25 SVM models are trained using grid search to estimate the hyperparameters. A parallel approach was implemented for this filter-wrapper proposal because of memory and computational requirements, reducing the time to obtain the best combination of features that minimises the error classification.

2.4. Support Vector Machine

In machine learning theory, support vector machine (SVM) is related to supervised learning methods that analyse data and recognise patterns in regression analysis and classification problems. In fact, a support vector machine (SVM) is a classification and regression paradigm originally invented by Vladimir Vapnik [55, 56]. In the literature, the SVM is quite popular above all in classification and regression problems mainly due to its good generalisation capability and its good performance [57]. Although SVM method was originally designed for binary-class classification, a multiclass classification methodology was presented in Wu et al. [58]. In the case of this PPI classification, it is straightforward to apply the binary-class classification between interacting and noninteracting pairs of proteins.

For a given training set of instance-label pairs {𝐱𝑖,𝑦𝑖}, 𝑖=1,,𝑁, with input data 𝐱𝑖𝑛 and labelled output data y𝑖{1,+1}, a support vector machine resolves the next optimisation problem:min𝐰,𝑏,12𝐰𝑇𝐰+𝐶𝑁𝑖=1𝜉𝑖,subjectto𝑦𝑖𝐰𝑇𝜙𝑥𝑖+𝑏1𝜉𝑖,𝜉𝑖0.(2.6)

So the training data vectors 𝐱𝑖 are mapped into a higher-dimensional space through the 𝜙 function. 𝐶 is the hyperparameter called penalty parameter of the error term, that is, it is a real positive constant that controls the amount of misclassification allowed in the model.

Taking the problem given in (2.6) into account, the dual form of an SVM can be obtainedmin𝛼12𝛼𝑇𝑄𝛼𝐞𝑇𝛼,subjectto𝑦𝑇𝛼=0,0𝛼𝑖𝐶,𝑖=1,,𝑁,(2.7) where 𝐞 is a vector composed of all ones (all-ones vector). 𝑄 is an 𝑁 by 𝑁 positive semidefinite matrix given by 𝑄𝑖𝑗𝑦𝑖𝑦𝑗𝐾(𝐱𝑖,𝐱𝑗). 𝐾(𝐱𝑖,𝐱𝑗)𝜙(𝐱𝑖)𝑇𝜙(𝐱𝑗) is called the kernel function and allows the SVM algorithm to fit a maximum-margin hyperplane in a transformed feature space.

The classifier is a hyperplane in the high-dimensional feature space that may be nonlinear in the original input space. In this case, for the general nonlinear SVM classifier, the decision function can be expressed as𝑦(𝑥)=sign𝑁𝑖=1𝛼𝑖𝑦𝑖𝐾𝑥,𝑥𝑖+𝑏,(2.8) where parameters 𝛼𝑖 correspond to the solution of the quadratic programming problem that solves the maximum-margin optimisation problem. The training data points corresponding to nonzero 𝛼𝑖 values are called support vectors [59] because they are the ones that are really required to define the separating hyperplane.

The most common kernel utilised in the literature is the radial basis function (RBF) or the Gaussian kernel [60]. It can be defined as𝐾𝑥,𝑥𝑖=exp𝛾𝑥𝑥𝑖2,𝛾>0,(2.9) where parameter 𝛾 controls the region of influence of every support vector.

Training an SVM implies the optimization of the 𝛼𝑖 and of the so-called hyperparameters of the model. These hyperparameters are usually calculated using gridsearch and cross-validation [59]. In the case of the RBF kernel, the hyperparameters 𝐶 and 𝛾 are required to be optimised.

Furthermore, a score is proposed in the presented work for PPI prediction. This score is estimated using the difference of probabilities in absolute value returned by SVM model for each pair of proteins.

This score would be used as a measure of confidence in PPI classification. SVM classifies the pairs reporting two probability values that express the chance to belong to an interacting pair class or noninteracting pair class. These probabilities are obtained by the particularisation of the multiclass classification methodology introduced by Wu et al. [58] in the problem of PPI prediction (binary classification). In a general problem, given the observation 𝐱 and the class label 𝑦, it is assumed that the estimated pairwise class probabilities 𝜇𝑖𝑗=𝑃(𝑦=𝑖|𝑦=𝑖or𝑗,𝐱) are available. Following the setting of the one-against-one approach for the general problem of multiclass problem with 𝑘 classes, firstly, the pairwise class probabilities are estimated by 𝑟𝑖𝑗 with r𝑖𝑗||1𝑃𝑦=𝑖𝑦=𝑖or𝑗,𝐱1+𝑒𝐴𝑓+𝐵,(2.10) where 𝐴 and 𝐵 are estimated by minimizing the negative log-likelihood function using known training data and 𝑓 are their decision values for these training data. In Zhang et al. [61], it is recalled that SVM decision can be easily clustered at ±1, making the estimate probability in (2.10) inaccurate. Therefore, ten-fold cross-validation was applied to obtain decision values in the experimental results. The next step is obtaining 𝑝𝑖 from these 𝑟𝑖𝑗, solving the following optimisation problem presented in Wu et al. [58].

The implementation for SVM was taken from the library LIBSVM [62] for Matlab (in this case R2010a). Specifically, C-SVM and RBF kernel was used in the development of the presented work.

2.5. Clustering Methodology

A clustering approach was applied to the negative dataset proposed by Saeed and Deane [22] in order to obtain a relevant, representative, and significant negative subset for training reliable SVM models. Saeed and Deane provide more than 4 million high-quality negative pairs. Therefore, after the feature extraction process applied to this large set of pairs, the set of data to consider would be represented as a matrix whose size is more than 4 million pairs (rows) and 25 features (columns). However, such amount of data is not feasible to train a model, and there is also an overrepresentation of negative data that hides the positive samples effect.

In order to reduce this amount of negative samples to select the most relevant noninteracting pairs, a “hierarchical” clustering approach is proposed in this section which is a iterative 𝑘-means process. Due to memory and computational requirements, the clustering data of 4 million noninteracting pairs were divided into subsets which are suitable to be computed by 𝑘-means. The 𝑘-means algorithm is applied to every subset. For each 𝑘-means, the 𝑘 nearest samples to centroid are taken as the most representative pairs of that specific subset. Then, these representatives are joined again creating a number of new subsets. Thus, the same process of 𝑘-means for each subset is applied in an iterative procedure as explained below.

Therefore, in the following lines, a definition of classic 𝑘-means is given. In data mining, 𝑘-means clustering [63] is a method of cluster analysis that assigns 𝑛 observations into 𝑘 clusters where each observation belongs to the cluster with the nearest mean. Given a set of observations (𝑥1,𝑥2,,𝑥𝑛), where each observation or point is a 𝑑-dimensional real vector, 𝑛 observations are then assigned into 𝑘 sets (𝑘𝑛) 𝑆=𝑆1,𝑆2,,𝑆𝑘 minimising the within-cluster sum of squares (WCSS) [63]:argmin𝐒𝑘𝑖=1𝐱𝑗𝑆𝑖𝐱𝑗𝝁𝑖2,(2.11) where 𝜇𝑖 is the mean of points in 𝑆𝑖.

Here, in the application of 𝑘-means, the used distance measure is the classical squared Euclidean distance and the clustering data is actually a matrix whose rows represent a pair of noninteracting proteins and columns represent the 25 considered features. The initial cluster centroid positions are randomly chosen from samples. Likewise, 𝑘 is set to 5000 because it is a value similar to the size of the considered positive set (GSP) and also for computational performance of this “hierarchical” clustering approach.

In practice, the 4 million set was divided in subsets of 50000 pairs approximately (49665 samples) creating 84 subsets of negative samples. This division was carried out due to memory requirements of the available computing system, using the maximum allowed limit. A classical 𝑘-means clustering algorithm [63] was applied to each subset obtaining the 5000 most representative samples, that is, reducing 10% of data. Then, new subsets of 50000 negative samples were created adding the 5000 respective samples in order. And again the 𝑘-means algorithm is applied to the new subsets obtaining the 5000 most representative samples. This process is repeated until the last 5000 most representative samples that have a similar size to the proposed positive set (see Figure 1) are obtained. This approach is a “hierarchical” and iterative 𝑘-means-based clustering algorithm that can be run in a parallel computing platform (see Section 2.6) considering the 𝑘-means clustering independently in every iteration.

More formally, if we pay attention to Figure 1, we can see that in Iteration 1, given an initial group of subsets of 50000 pairs approximately 𝐂=𝐶11,𝐶12,,𝐶184. As commented, the proposed “hierarchical’’ clustering approach is an iterative 𝑘-means process applied for each 𝐶𝑗𝑖 where 𝑖 is the iteration and 𝑗 is the subset order. The resulting set for the 𝑘-means method is called 𝑅𝑗𝑖 using the same indices 𝑖 and 𝑗 from the input subset 𝐶𝑗𝑖. Thus, 𝑅𝑗𝑖 is formed by the set of the 5000 most representative negative samples from 𝐶𝑗𝑖 selected by 𝑘-means. In the next iterations, 𝐶𝑗𝑖+1 is the subset formed by the summation of the 10 sets of the 5000 most representative negative samples 𝑅𝑗𝑖. When it is not possible to apply the summation of every 10 subsets 𝑅𝑗𝑖 because there is an inferior number of subsets, the summation is composed by the maximum number of subsets until completing all considered data. In general, 𝐶𝑗𝑖=𝑗10𝑚=(𝑗1)10+1𝑅𝑚𝑖1 given the iteration 𝑖 and the subset 𝑗. In this paper, 3 iterations were executed until obtaining the set of the 5000 most representative negative samples from the whole set of more than 4 million negative samples. Iteration 2 shows that there were 9 subsets 𝐶12,𝐶22,,𝐶92 where 𝐶92 contains 20000 pairs. The resulting subsets by 𝑘-means 𝑅12,𝑅22,,𝑅92 create a new 𝐶13 of 45000 elements. In the final step, 𝑅13 is obtained in Iteration 3, which will be used as part of a training set as a representation of the negative space from the whole negative set. The 𝑅12,𝑅22,,𝑅92 will be used as testing set in Section 3, and after a filtering process from the training set, they are called 𝑅test_13, 𝑅test_23, 𝑅test_32, 𝑅test_43, 𝑅test_53, 𝑅test_63, 𝑅test_73, 𝑅test_83, and 𝑅test_93.

With this process, the main goal of obtaining a representative negative dataset and not biased from a high-quality negative set is fulfilled.

2.6. Parallelism Approach

The filter/wrapper feature selection proposed in this work demands high computational resources. The classical and simple master-slave approach was adopted [64], a master process sends tasks and data to the slave process, and the master process receives results from slaves and controls the finalisation of the tasks. In our case, the tasks are to train SVM model including grid search for hyperparameters. Therefore, the master process sends the next data for slave processes: the selected features and the training and testing datasets. In addition, the “hierarchical” 𝑘-means clustering algorithm from the previous section could be implemented in a parallel computing platform using this approach.

The implementation of this approach was carried out using MPIMEX [65], a new interface that allows MATLAB standalone applications to call MPI (message passing interface) standard routines (it was developed in our research group). MPI is a library specification for message passing, proposed as a standard by a broadly based committee of vendors, implementers, and users as defined in http://www.mcs.anl.gov/research/projects/mpi/.

This parallel approach was running in a cluster of computers. This cluster was formed by 13 nodes with dual processors Intel Xeon E5320 2.86 GHz, 4 GB RAM memory, and 250 GB HDD. All nodes are connected using Gigabit Ethernet. The installed operating system is Linux CentOS 4.6 (rocks). This cluster was purchased using public funds from the Spanish Ministry of Education Project TIN 2007-60587. The time of execution was reduced from 16 days in a single computer to 32 hours to train all the SVM models.

3. Results and Discussion

The results consist of two parts. In the first part, a “suboptimal” set of features is selected through the filter/wrapper feature selection process using the parallel approach. The training data for RBF-SVM model is composed by a GSP set and for a GSN set which is the set which resulted from applying iterative clustering approach as explained in section Material and Methods. In the second part, taking this suboptimal set of features, three RBF-SVM classifiers are constructed using three training sets, respectively. All training sets have the same GSP set for positive examples. In one case, the GSN set is the negative set obtained using the hierarchical clustering method from the first part and, in a second case, the GSN set is a randomly selected negative set as commented. In the third case, the GSN set was created using the approach proposed by Yu el al. [38], it is a “balanced” set to GSP. Subsequently, a comparison of the results obtained of three RBF-SVM classifiers trained with all the proposed negative datasets is discussed.

Previously the filter/wrapper feature selection process, the feature extraction process is applied to all available datasets. The 25 features were also extracted for the 4 million negative set from Saeed and Deane [22], but, due to computational requirements, the whole set was divided into 84 subsets of 50000 samples approximately. In order to obtain a representative negative dataset of the whole negative space, the iterative 𝑘-means clustering approach was applied to these 84 subsets as explained in the Section 2.5. In total, three iterations selecting 5000 negative representative samples were realized using the clustering approach. In the first iteration, the Euclidean 𝑘-means method was applied to the 84 subsets creating 5000 centroids, and 9 new subsets (8 subsets of 50000 and the last one of 20000 negative examples) were obtained adding the selected 5000 negative representative samples of each previous subset. In the second iteration, the 𝑘-means was applied again to the 9 new subsets taking 5000 new negative representative samples of each subset and creating another new subset of 45000 samples (the representatives of 9-subset summation). In the third and last iteration, the last 5000 most representative negative samples taken as GSN set for training data were obtained from clustering the previous subset. The taken negative pairs were selected using the minimum Euclidean distance to the centroid of each cluster. A diagram of this process is represented in Figure 1.

In this way, the considered data (GSP and clustered GSN sets) was used to apply the presented paralleled filter/wrapper feature selection process. Because of memory requirements in the construction of the 25 SVM models, this data was randomly divided into 70% for training SVM and 30% for testing the performance of obtained models. Hence, four randomly divisions of data as 4 training/test datasets were used in this feature selection approach in a cluster of computers as commented in Section 2.6. In order to obtain the best hyperparameters for SVM models, gridsearch and 10-fold cross-validation were implemented. In Figure 2, the accuracy, sensitivity, and specificity obtained using the order of feature relevance reported by mRMR filter method are shown for all 25 SVM models. It can be observed that an excess of information may lead to overfitting, that is, the interaction information decreases when adding more features to the models, specially for testing case. The last added features were considered for mRMR method as more irrelevant or redundant than the features in the first positions. In Figure 2, it can be observed that the performance does not significantly improve after reaching 6 features, it even gets worse due to an excess of information, so the suboptimum selected set is composed of those 6 features: 13th referring to global similarity measure for 1st feature, common GO terms using all ontologies, 3rd referring to number of SwissPfam domains for a pair in 3did, 10th referring to common terms for the two proteins in MIPS phenotype catalogue, 8th referring to common terms for the two proteins in MIPS functional catalogue, 7th referring to common terms for the two proteins in MIPS complexes catalogue, and 2nd referring to number of shared homologous proteins between a pair of proteins.

In the selected suboptimum set, the features concerning protein complex, phenotypes, and functional data from MIPS CYGD catalogues have already been used successfully and proved themselves to be reliable in interacting prediction analysis [10, 35, 6669]. Note that global similarity measures were also included in this suboptimum set of features with the purpose of improving the performance of the classifier in PPI prediction. At the same time, domain information (3rd feature) has provided a suitable framework in PPI prediction works [4, 13]. Moreover, the second feature refers to homology whose relevance has been shown in previous publications [11, 19, 21, 22].

In order to check if the SVM models trained with 6 features are significant, a ROC (receiver operating characteristic) was plotted using the confidence score presented in this work, previously explained in Section 2. The ROC curve shows the sensitivity values with respect to 1-specificity values. The used statistics to measure the goodness of the classification was the widely extended AUC (area under curve) [70, 71]. This statistics represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. In Figure 3 and Table 1, the results for 6-feature SVM model and 25-feature SVM model showing better performance of the SVM trained with a suboptimum set are shown. As we mentioned, this reduction in the number of features implies a significant saving in memory, calculation, and other computational requirements, obtaining an overfitting utilising the whole set.

In the second part, the behaviour of our approach is tested using the selected subset of the six most relevant characteristics. Three RBF-SVM models are built with three training sets, sharing the same GSP but with a different GSN. In one case, the GSN is the negative set from the first part created using the proposed hierarchical clustering approach presented in this paper (it is also called clustered training dataset). In the second case, the GSN is a randomly selected negative set (called random training dataset), and, in the last case, the GSN is a negative set “balanced” to GSP set obtained using the approach by Yu et al. [38]. This third GSN is created using a selection of unbiased negative examples based on the frequency of the proteins in the positive set. The testing datasets, detailed in Section 2, cover both positive and negative sets and they were obtained in different ways: experimentally, from the literature, and computationally. Additionally, in order to make a reliable comparison, previous to the evaluation of our models, the interactions for each testing dataset were filtered out to avoid overlapping with the respective training set. The new sizes of the testing datasets are shown in Table 2.

Therefore, the results of these models are shown in Table 3 and Figure 4 for positive datasets and Figure 5 for negative datasets. In general, the SVM model trained using the negative set generated by the proposed hierarchical clustering approach presented in this paper has a better performance in comparison with the rest of models, that is, the models that used the randomly selected negative set and the balanced negative set. Globally, the obtained results were slightly worse in the experimental datasets than in the computational and literature datasets. The models classify the literature-extracted dataset “LC-multiple’’ with a range between 93 and 95% of accuracy. For the computationally obtained “binary-GS’’ dataset, the classifiers obtain a range of accuracy between 92 and 95%. Between the experimental datasets “Uetz-screen’’ [24] and “Ito-core’’ [25], the reported accuracies are sightly lower than for the previous datasets with ranges of 72–81% and 76–80%, respectively, for the case of the models trained with the negative set from the clustering approach and the negative set from the random selection. Nevertheless, in the case of the model trained using the “balanced’’ negative set, the accuracies for both datasets are about 50%. However, if we can consider the nature and complexity of the filtering in experimental data, the obtained accuracy is still satisfactory at least in the case of the model trained using the negative set from the clustering approach. Referring to the different negative datasets in the training data, the model trained using the negative set extracted by clustering method attained better results than the model trained using a randomly selected negative set. The obtained minimum relative difference is about 28% compared to the randomly selected negative set, and the maximum difference is about 90% in the case of the model trained using the “balanced’’ negative set. The negative set obtained by the “hierarchical’’ approach has a relevant representation of the negative search space from a large high-reliability negative set from Saeed and Deane [22]. But in the case of the “balanced’’ negative set, this is not happening, the negative set is “balanced’’ to the positive side in the training data but it is not enough to recognise any negative case. Hence, the obtained results of the model predicting negative datasets are worse than the results in the classification of positive datasets. Nonetheless, the difficulty and complexity to predict negatives make the results still acceptable. It can be observed that the relative difference in positive datasets is better for the model trained with the randomly selected negative set but that difference is not so strong, it can even be a slightly overestimation. The accuracy could be artificially inflated by a bias towards dominant samples in the positive data as Yu et al. showed [38]. With such a suboptimum set of features, an SVM model is able to classify PPIs with relative notorious accuracy in any positive and negative datasets.

First, in Patil and Nakamura [19], the authors used a Bayesian approach, previously proposed by Jansen et al. [10] with only three features for the filtering out of high-throughput datasets of the organisms Saccharomyces cerevisiae (Yeast), Caenorhabditis elegans, Drosophila melanogaster, and Homo sapiens. Their model was able to obtain a sensibility of 89.7% and a specificity of 62.9%, being only capable of attaining a prediction accuracy of 56.3% for true interactions for the datasets Y2H, external to the model. For two datasets called “Ito” and “Uetz” (see Table 3), the presented model trained with the negative set from clustering method reported classification rates between 76 and 93%. In Jiang and Keating [72], a mixed framework is proposed combining high-quality data filtering with decision trees in PPI prediction, taking as the base the notation of all GO ontologies, aiming an accuracy in a range of 65–78%. From there, we incorporated that information in combination with other features to improve the generalisation of our approach. Other similarity measures have been proposed, mainly based on the GO annotations, for example, the works by Wu et al. [35] that were able to detect the 35% of the cellular complexes from the MIPS CYGD catalogues or the work by Wang et al. [36] for the validation of gene expression analysis. Nevertheless, the authors did not take into account the cellular component ontology because it was considered that this ontology includes ambiguous annotations that may lead to error. In this paper, we opted for proposing a set of similarity measures that permit their generalisation to a wide range of databases in the obtaining of our prediction model.

4. Conclusion

In this work, a new approach to build an SVM classifier in PPI prediction is presented. The approach has several notorious processes: a feature extraction using well-known databases, a new filter-wrapper feature selection implemented in a master-slave parallel approach, and a reliable and representative negative dataset for training by the means of “hierarchical’’ 𝑘-means clustering. The filter method is based on the statistical concept of mutual information using mRMR criterion, which is a reliable and quick method. In addition, a confidence score is presented through a modification of SVM model implementation. A comparison between a randomly selected negative dataset, a “balanced” negative set obtained using Yu et al. approach [38], and a negative dataset obtained using the “hierarchical” 𝑘-means clustering method presented in this paper is done where the model training using the set resulted by the clustering approach has better performance. This comparison also allowed us to check the generalisation capacity of the presented approach for the sake of the evaluation of previously filtered external datasets. Hence, a fair negative selection method is presented avoiding the overestimation in the classification of PPIs.

For further work, a hierarchical parallel clustering could improve the performance of a classifier with the purpose of obtaining a balanced negative dataset using a more complex clustering algorithm. We consider applying this approach to other model organisms as Homo sapiens. A parallel approach was applied, which, by making a better load balancing, would be suitable to reduce time computation in the filter/wrapper feature selection approach.

In summary, we conclude that by combining data from several databases, using reliable positive and clustered negative samples for training, supporting a set of widely applicable similarity measures to the feature extraction process, and using mutual information methods for feature selection and RBF-SVM models capable of returning a confidence score, we have presented a reliable approach to the validation of protein-protein interaction datasets.

Acknowledgments

J. M. Urquiza is supported by the FPU Research Grant AP2006-01748 from the Spanish Ministry of Education. This paper has been partially supported by the Spanish CICYT Project SAF2010-20558 and by the Regional Excellence Projects P07-TIC-02768 and P09-TIC-175476.