1 Introduction

Due to modern lifestyle, diseases are increasing rapidly. Our lifestyle and food habit leads to create impact on our health causing heart diseases and other health issues. Data mining technique is one of the most challenging and leading research areas in healthcare due to the high importance of valuable data [1]. The recent blooming in the data mining approaches has provided a solid platform for various applications in the healthcare field. In healthcare, data mining is playing a vital role in different fields like intrusion detection, pattern recognition, cheaper medical treatments’ availability for the patients, disease diagnosing and finding its procurement methods [2, 3]. An artificial intelligence makes the system more sensitive and activates the system to think. In machine learning, AI acts as a sub-field to perform better prediction [4].

It also accommodates the researchers in the field of healthcare in development of effective policies, and different systems to prevent different types of disease, early detection of diseases can reduce the risk factor. The aim of our work is to predict the diseases among the trained dataset using classification algorithms. It has been trained the Naive Bayes and random forest classifier model with three different disease datasets namely—diabetes, coronary heart disease and cancer datasets and performance of each model are calculated. Over-fitting of single decision tree problem is overcome by applying the Random forest algorithm [5]. Random forest algorithm provides better prediction accuracy compared with the Naïve Bayes algorithm. In addition, it has been applied few sample test data of the three diseases to those classified models to show whether the patient data in sample test are suffering from that disease or not [6].

Artificial neural networks are the best effort classification algorithm for prediction of medical diagnosis due to its best efficiency parameter [7]. The neural network comprises of the neurons with three layers such as input layer, hidden layer and output layer for the efficiency attainment. The training data are given as the input parameter with the support of back propagation algorithm. The feed-forward neural network with support vector machine (SVM) is a best technique for prediction of cancer [8]. The ANN is used to classify the labeled images based on the determination of the true positive (TP) and false-positive (FP) detection rates. The detection mechanism is performed with the self-organized supervised learning algorithm. ANN approach gives the promising result for the detection of micro-calcifications features and biopsy detection [9]. The ANN is segmented into two approaches, initially the classifier is applied to the image data with region of interest (ROI) and second includes the ANN learn the features from pre-processed image signals. SVM is a statistical learning theory-based machine learning approach. SVM works with the ANN to map the input space to the higher-dimensional space to split the labeled images. The labeled images are determined in a marginal space forming a hyper plane which reduces the generalization error [10].

A hybrid classifier is proposed here by hybridizing support vector machine and artificial neural network [7]. A typical ANN consists of one input layer, one or more hidden layers and one output layer as in Fig. 1. Each layer has several neurons, and the neurons in one layer are connected to the neurons in the adjacent layer with its own connection weights [11]. The following figure shows the artificial neural network model with one input layer, one output layer and one hidden layer. The neuron represents node in the network. The input features are fed to the neurons in the input layer.

Fig. 1
figure 1

Block diagram for proposed method

SVM is the supervised learning model, which can perform well even with the smaller data samples [12, 13]. SVM classifier has no curse of dimensionality since it has the ability to manage sparse data in high-dimensional datasets [14]. Also, SVM classifier has better generalization than the ANN and avoid local minima problem. The logic behind the SVM method is it creates the optimal separation plane under linearly separable condition [12]. The hyper plane is optimized by increasing the margin. The margin is the distance between boundary and nearest point of each class. These points nearest to the boundary are called support vectors [15]. For healthcare data analysis, data mining techniques like classification, association rules and clustering are commonly used.

The main contribution of this paper is summarized as follows.

  • Initially disease dataset is taken as a input for the system. Diabetes, heart disease and cancer datasets are taken for the analysis, so many informations are related to the patients health care and general data there in disease dataset. These are the life-threading diseases for human beings.

  • Data preprocessing is applied to the input datasets, it leads to reduce the unwanted information for the further analysis. Check the missing values and checking the correlation it helps to split the training data in to 70% original data and 30% testing data, for efficient data analysis.

  • Data-mining algorithms like random forest and Gaussian Naïve Bayes are applied to estimate the performance of the system against the input disease dataset. The classification results are compared with existing results, and it shows the better improvement.

The paper is organized as follows, Sect. 1 represents the introduction, Sect. 2 represents the literature survey, while Sect. 3 proposes the proposed method, Sect. 4 represents the results and discussions, Sect. 5 proposes performance evaluation, Sect. 6 identifies the proposed metrics with random forest and the final Sect. 6 proposes the conclusion with references.

2 Literature survey

Data mining is a growing field that transforms piece of data into useful information. This technique helps the authorized person make informed options and take right decisions for their betterment [16]. It used to understand, predict and guide future behavior based on the hidden patterns among huge dataset. It leads to offer tools for automated learning from the history of data and developing models to discover the outcomes of future scenarios. There are the various tools for data mining machine learning algorithms to identify and predict the various disease in terms of regression, decision tree and Bayesian network [17]. Finding of a disease, needed different test results in variety of scenarios with respect to the particular patient. By applying data mining, concept for data analysis number of tests will be reduced. It plays a vital role in data analysis to improve the performance and time saving [18].

Variety of classification and clustering algorithms plays a significant role for prediction and diagnosis of different types of diseases. Bayesian network classifier and random forest classifier are used to diagnose the risk for diabetes [10, 19, 20]. The prediction accuracy of the k-means algorithm is enhanced using both class and cluster method and making it adapt to different datasets [21]. A group of classification algorithms excluding random forest algorithm is applied on diabetes data to diagnose the risk. On comparing the performance of each method, the outcome shows that Random Forest was performed well in both accuracy and ROC curve [8, 22, 23].

In ANN hybrid classifier, each neuron of the hidden and output layer receives signals from the previous layer multiplied with the weights of the interconnection. The neuron then produces the output by passing the summed signal through the defined transfer function.

The network is trained for the given input iteratively. In each iteration, the mean square error (MSE) between the target and the achieved output is calculated. The MSE for the jth iteration is defined in Eq. (1) as follows,

$$ F\left( X \right) = \left( {t_{i} \; - \;a_{i} } \right)^{2} $$
(1)

tj and aj are the targeted and the achieved output, respectively. The network is trained by adjusting the weights and the bias so that the MSE get minimized. The MSE estimates the posterior probability function for the classification problem. Here in the gradient descent method, back-propagation (BP) method uses the calculated MSE at each layer to adjust the value of the interconnected weights. Though ANNs are good classifiers, they require large number of training sets to train for proper behavior. That is why it founds fine if hybridized with the classifiers which could require a smaller number of training samples to classify properly. Neural network-based cancer classifiers are used with binary and multi-class problems to identify the cancerous samples [8].

Applied Naïve Bayes algorithm developing an artificial intelligent system, based on the comparison of certain parameters used to predict whether a person is having diabetic problem or not [2, 3]. The artificial intelligent-based methods are very effective and popular one in recent years [24]. The diagnosis of diabetes and cancer prediction the adaptive neuro-fuzzy inference system shows better accuracy [12]. Also shows the accuracy information of Naive Bayes classifier and K-means algorithm. 80% accuracy is obtained from this method [12]. Modified extreme learning machine and back-propagation neural network method are addressed in prediction of diabetes mellitus [13]. The data mining techniques such as K-means algorithms, MAFIA algorithm, decision tree algorithm and other classification algorithms provide reliable performance in diagnosing the heart disease [7, 16, 25, 26]. It helps a non-specialized doctor to make the right decision about the heart disease risk level by generating original rules, pruned rules, classified rules and sorted rules [27].

3 Proposed method

The proposed method has been used with Anaconda tool (AEN 4.1 Version) for data analysis. Anaconda a package management system manages the package versions for predictive analysis and data management [28]. It has been taken three disease patient data such as diabetes, coronary heart disease and breast cancer data as input, the reason for choosing these datasets only because of life threading characteristics and find the efficiency of the proposed method in fruitful manner and some relativity is there between these datasets. These data are loaded and checked to see whether it has any missing values or not. If any missing values are found, they are replaced to a null value. Then it has been checked whether any columns in the data have any correlation with another column in the data individually. If any correlation is found between two columns, one of those column is removed. If any true and false value is found in data, it is replaced to 1 and 0, respectively. It has split the original data into training data which has 70% of original data and test data which has 30% of the original data.

To check the number of true and false cases in original, training and test data of the three-class data, it has been trained our three different class data with the Naive Bayes algorithm and calculate the results accuracy this algorithm gives to the three classes separately using confusion matrix [29]. The block diagram of the proposed method is shown in Fig. 1. The performance report shows the performance metrics of the accuracy calculation for each class data individually. Similarly, it has trained our three different class data with the random forest algorithm. It has calculated the results accuracy, and this algorithm gives the three classes separately using confusion matrix. The performance report shows the performance metrics of the accuracy calculation for each class data individually. Internal model parameters updated through epoch, every epoch contains one or more batches. The epoch can be applied until minimize the errors in datasets [30, 31].

It has taken few sample test data separately for each class data. Applying these sample data on each trained model of that disease shows us the results whether the data are identified with that disease or not. While comparing the results of both model for each class data, it can see that the model trained with Random forest gives the accurate results of classification. To find the efficiency of the proposed method, the trained data are compared separately with the proposed algorithms and also checked the performance of test data. The proposed method is also applicable for testing the real-time disease data for classification and to identify whether the patient is affected by the particular disease or not.

3.1 Proposed algorithm

3.1.1 Naive Bayes classification algorithm

Here it has used Bayes theorem for classification purpose and to assume that classification is predictor independent. It assumes that Naive Bayes classifier in the presence of a particular feature in a class is unrelated to any other feature.

Naive Bayes model is compatible for very large datasets to build and for further analysis. This model is very simple and sophisticated classification method, and it performed well even in complicated scenarios. By using Bayes theorem, calculate the posterior probability using the equation below:

$$ P\left( {a/y} \right) = {{\left( {P\left( {y/a} \right)P\left( a \right)} \right)} \mathord{\left/ {\vphantom {{\left( {P\left( {y/a} \right)P\left( a \right)} \right)} {P\left( y \right)}}} \right. \kern-\nulldelimiterspace} {P\left( y \right)}} $$
(2)

where P(a/y) indicates the posterior probability of class, P(a) represents the class prior probability, P(y/a) shows the likelihood which is the probability of predictor given class and P(y) indicates the predictor’s prior probability.

3.1.2 Random forest algorithm

The random forest (RF) is a hierarchical collection of tree structured base classifiers. Text data usually have many number of dimensions. The dataset contains a large number of irrelevant attributes. Only few important attributes are informative for classifier model. RF algorithm uses a simple predetermined probability to select the most important relevant attribute. Breiman formulated the RF algorithm using sample data subsets and to construct multiple decision trees by mapping random sample of feature subspaces. The RF algorithm associated with a set of training documents D and Nf features can be described as follows:

  1. (1)

    Initial: D1, D2,…….DK sampled by predetermined probability with replacement.

  2. (2)

    For each document DK construct a decision tree model. The training documents are randomly sampled using its subspace of m-try dimension from the available features. Calculate all possible probability based on the m-try features. The leaf node produces the best data split. The process will be continued till it reaches the saturation criterion.

Combine the K number of unpruned trees h1(X1), h2(X2),…….. into a random forest ensemble and use the high probability value for classification decision.

Random forest pseudocode:

  1. 1.

    Randomly select “n” features from total “k” features.

    1. 1.

      Where n <  < k

  2. 2.

    Among the “n” features, calculate the node “n” using the best split point.

  3. 3.

    Categorize the node into daughter nodes using the best split.

  4. 4.

    Repeat 1 to 3 steps until “l” number of nodes has been reached.

  5. 5.

    Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.

4 Results and discussion

4.1 Dataset

Here, it has multiple disease data such as diabetes, coronary heart disease and breast cancer. The dataset has been collected using the wearable devices and the prediction data.

4.1.1 Diabetes dataset

This dataset is taken originally from the NIDDK. All patients’ data given here are females at least 21 years old of Pima Indian heritage.

  • Pregnancies: baby deliveries happened in number of times

  • Glucose: the concentration test in glucose using the tolerance test for every 2

  • BP: diastolic BP (mm Hg)

  • Skin thickness: the thickness of skin in triceps fold (mm)

  • Insulin: insulin serum for 2-h (mu U/ml)

  • BMI: height/weight

  • Diabetes: prediction in mm

  • Age: in years

  • Outcome: true class variable either (0 or 1)

4.1.2 Coronary heart disease

This dataset is used by Framingham heart study which includes several demographic risk factors:

  • Age: current age of the patient;

The dataset also includes behavioral risk factors associated with smoking:

  • Smoking nature: the patient is a current smoker or not

Medical history risk factors:

  • BPMeds: whether the patient was on blood pressure medication or not.

  • prevalentStroke: whether the patient had a stroke previously or not.

  • prevalentHyp: hypertensive or not.

  • Diabetes: patient has diabetes or not.

Risk factors from the first physical examination of the patient.

  • Cholrange: total cholesterol level.

  • BPs: systolic blood pressure.

  • diBl: diastolic blood.

  • BMI: body mass index.

  • HR: heart rate.

  • GL: glucose level.

  • CHDRISK: CHD coronary heart disease.

4.1.3 Breast cancer

Here, the dataset taken is breast cancer Wisconsin dataset. The attributes of this dataset are.

  • Regn ID

  • Diagnosis (1 = true, 0 = false) and various data

4.1.4 Filling in missing values

Missing value in any data means that the data were not available or not applicable or the event did not happen. Here, it has been replaced the missing values not available into null values.

4.1.5 Correlation coefficient

Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. It is the statistical measure of the linear relationship between a dependent variable and an independent variable. It is represented by a lowercase letter ‘r’.

Here, the correlation between all the columns of the datasets is calculated to measure their relationship (Table 1). The results give the correlation value of each column in a dataset against another column in that dataset. If two columns in a dataset have same correlated values, then one among them is removed to avoid repetition of values.

$$ r = \frac{{m{ }\left( {\sum {ab} } \right){ }{-}{ }\left( {\sum a } \right)\left( {\sum b } \right)}}{{\sqrt {\left( {\left[ {m\sum {a{ }\widehat{{}}2} { }{-}{ }\left( {\sum a } \right){ }\widehat{{}}2} \right]\left[ {{\text{n}}\sum {b\widehat{{}}2} { }{-}{ }\left( {\sum b } \right)\widehat{{}}2} \right]} \right)} }} $$
(3)
Table 1 Diabetes data

m’ represents the quantity of information, ∑a indicates the total of first variable value, ∑b represents the total of second variable value, ∑ab indicates the sum of the product of first and second variable values, (∑a)^2 indicates the sum of the square of first value, and (∑b)^2 represents the sum of the square of second value.

In Fig. 2, the positive coefficients are indicated with the blue color and negative coefficients in red. The color intensity is found proportional to the blue and red indicted.

Fig. 2
figure 2

The legend colors show the correlation coefficients and its corresponding colors (color figure online)

All the pairs of variables and correlations represented by the correlogram. Blue color represents the positive correlations, and red color represents the negative correlations. The colour intensity and correlation coefficient are proportional (Table 2).

Table 2 Heart disease data

Finally a negative correlation of two variables implies that under consideration and changes in opposite directions, i.e., if any one of the variable increases other one is decreases and vice versa.

The correlation coefficient value is ranged from −1 to 1. If the rage is not fallen between this value means some error is there in the system. The correlation value −1 represents a exact negative correlation. The correlation value 1 shows that perfect positive correlation. wt, vs, qsec, mpg, hp, gear, drat, disp, cyl, carb, and am are representing the different parameters.

4.2 Confusion matrix

The confusion matrix is helpful to predict the classification problems. In the predicted class, the total number of exact predictions for a class goes into the expected row for that class value. In the same manner, the total number of incorrect class predictions go into the expected row for that class value and the class value of predicted column.

A confusion matrix having the information or data about actual and predicted classifications finished by a classification process as shown in Fig. 3. The performance is evaluated using the available data in the matrix (Table 3). The confusion matrix for a two-class classifier is shown in the table.

  • Always positive (AP)—the classification model correctly finds class positively.

  • Always negative (AN)—the negative class exactly labeled by the classifier.

  • Always least positive (ALP)—the classification model was incorrectly predicted and labeled as positive.

  • Partial least negative (PLN)—these are the positive classes that were incorrectly predicted as negative one.

Fig. 3
figure 3

Confusion matrix

Table 3 Cancer data

4.2.1 Accuracy calculation

The prediction accuracy is calculated using the formulae

$$ {\text{Accuracy}} = {{\left( {{\text{AP}} + {\text{AN}}} \right)} \mathord{\left/ {\vphantom {{\left( {{\text{AP}} + {\text{AN}}} \right)} {\left( {M + N} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {M + N} \right)}} $$
(4)

where, M = AP + AN and N = ALP + AN. Or AP + AN (TOTAL).

4.2.1.1 Precision (positive predictive value)

Precision (PREC) is a classification technique which is used to find the items that are incorrectly labeled among the given class. The best precision result is 1.0, whereas the worst one is 0.0.

$$ {\text{PREC}} = {{{\text{TP}}} \mathord{\left/ {\vphantom {{{\text{TP}}} {\left( {{\text{TP}} + {\text{FP}}} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{\text{TP}} + {\text{FP}}} \right)}} $$
(5)
4.2.1.2 Recall

Sensitivity (SN) is calculated using the number of correct positive prediction value divided by the total number of positive predictions. It is also called as recall (REC) or true positive rate (TPR). The best value is 1.0 and the worst value is 0.0.

$$ {\text{SN}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right) / \left( {{\text{TP}}/{\text{P}}} \right) $$
(6)
4.2.1.3 F1-score

F1-score is a weighted average of recall and calculated precision value.

$$ F1 = 2{\text{TP}}/\left( {2{\text{TP}} + {\text{FP}} + {\text{FN}}} \right) $$
(7)
4.2.1.4 Support

Support is the number of occurrence of true and false values in actual class.

For diabetes data, Naive Bayes algorithm gives 76.72 and 74.46 accuracies for training and test data, respectively. Random forest algorithm gives 98.88 and 74.03 for training and test data, respectively.

For heart disease data, Naive Bayes algorithm gives 82.44 and 82.35 accuracies for training and test data, respectively. Random Forest algorithm gives 97.96 and 83.85 for training and test data, respectively.

For cancer data, Naive Bayes algorithm gives 62.06 and 63.74 accuracies for training and test data, respectively. Random forest algorithm gives 99.50 and 92.40 for training and test data, respectively, as shown in Figs. 4 and 5.

Fig. 4
figure 4

Diabetes train and test data accuracy for two algorithms

Fig. 5
figure 5

Heart disease train and test data accuracy for two algorithms

When applying sample test data of each disease to its trained model, the model trained with random forest algorithm gives accurate results when compared to the models trained with Naïve Bayes algorithm. The main reason for the performance of random forest algorithm against test data is the property of high-dimensional feature and self-judge the essential features in the dataset. Feature interaction is also recognized by the random forest algorithm.

The results of the diagnosis are given in 1′s and 0′s where 1 indicates the sample test case has been diagnosed to have that disease. The model trained with random forest algorithm gives accurate results. The deviation is there in the results of training data and testing data results, it’s only because of the number of data tested is varied in both cases [29].

5 Performance analysis

The calculated precision accuracy of testing data and trained data is compared with K-means clustering algorithm and DBSCAN (density-based spatial clustering of applications with noise) for finding the effectiveness of the proposed algorithms against diabetes dataset, coronary heart diseases and cancer dataset. The k-means clustering algorithm is one of the machine learning algorithm used to estimate the centroid of the cluster based on the calculated mean values. Some authors used the same algorithm for analysis, and we have compared the results of the k-means clustering algorithms with our proposal and the results shown in the figures (Fig. 6). The DBSCAN algorithm working is based on adding of noise content with the original data for effective detection of required data. The DBSCAN algorithm results taken from the previous proposal are considered for our comparison. The same benchmark dataset is taken for the analysis.

Fig. 6
figure 6

Cancer train and test data accuracy for two algorithms

Figure 7 shows the test data accuracy calculation results of Naïve Bayes algorithm, K-means clustering algorithm and random forest algorithm. The test data are taken for the analysis, and the performance of the k-means algorithm is somewhat better against the heart diseases dataset and cancer datasets. But the performance of proposed random forest algorithm is far better than k-means clustering.

Fig. 7
figure 7

Test data compared with K-Means clustering algorithm

Figure 8 shows the accuracy calculation results of Naïve Bayes algorithm, DBSCAN clustering algorithm and random forest algorithm. The test data are taken for the analysis, the results of DBSCAN algorithm are somewhat better against the cancer datasets. But the performance of proposed random forest algorithm is performed will than DBSCAN clustering.

Fig. 8
figure 8

Test data compared with DBSCAN Algorithm

Figure 9 shows the accuracy calculation results of Naïve Bayes algorithm, K-means algorithm and random forest algorithm. The training data are taken for the analysis, and the results of K-means algorithm are somewhat better against heart diseases and cancer datasets compared with Naïve Bayes algorithm. But the performance of proposed random forest algorithm shows good results than K-means clustering.

Fig. 9
figure 9

Training data compared with K-Means clustering algorithm

Figure 10 shows the accuracy calculation results of Naïve Bayes algorithm, K-means algorithm and random forest algorithm. The training data are taken for the analysis, the results of DBSCAN algorithm are somewhat better against heart diseases, and cancer datasets are compared with Naïve Bayes algorithm. But the performance of proposed random forest algorithm shows good results than DBSCAN clustering.

Fig. 10
figure 10

Training data compared with DBSCAN Algorithm

6 Conclusion

Data mining can be effectively implemented in medical domain. The aim of this study is to discover a model for the diagnosis of diabetes, coronary heart disease and cancer among the available dataset. The dataset is chosen from online repositories. The techniques of pre-processing applied are filled in missing values and removing correlated columns. Next, the classifier is applied to the preprocessed dataset, and then Bayesian and random forest models are constructed. Finally, the accuracy of the models is calculated and analyses are based on the efficiency calculations. Bayesian Classification network shows the accuracy of 74.46, 82.35 and 63.74% for diabetes, coronary heart disease and cancer data. Similarly, classification with Random forest model shows the accuracy of 74.03, 83.85 and 92.40. The accuracy outcome of Random forest model for the three diseases is greater than the accuracy values of Naïve Bayes classifier. Finally, the proposed algorithms are compared against K-means clustering algorithm and DBSCAN algorithm for identifying the effectiveness, and the result graph shows that the random forest algorithm works well compared with other two algorithms. When performing classification in the trained model by applying sample test data of each disease, the random forest model gives accurate results. The proposed model works well against train data and test data further this model will provide the better results for real-time data.

Our proposed methodology helps to improve the accuracy of diagnosis and greatly helpful for further treatment. In future enhancements, the accuracy has to be tested with different dataset and to apply other AI algorithms to check the accuracy estimation. The limitation of the proposed model is processing time, because of huge amount of data taken for estimating the performance of train data. In future, the same algorithms to be implemented with real-time data for estimating the effectiveness of the system.