1 Introduction

As a new data processing ear like Big Data, Cloud Computing, and Internet of Things (IoT) approaches, the amount of data being collected in databases far exceeds the ability to reduce and analyze these data without the use of automated analysis techniques, data mining [25,26,27, 34, 36]. As the importance of data mining has grown, one of the critical issues to emerge is how to scale data mining techniques to larger and larger databases [2, 24, 35]. This is particularly true for computationally intensive data mining tasks such as identifying natural clusters of instances [10, 18]. Several approaches to scalability enhancements have been studied at length in the literature [4, 32], including using parallel mining algorithms [9, 23] and preprocessing the data by filtering out redundant or irrelevant features and thus reducing the dimensionality of the database [32]. Another approach to better scalability is using a selection of instance from a database rather than the entire database [29, 31, 41].

Perhaps the simplest approach to instance selection is random sampling [5, 6, 21]. Numerous authors have studied this approach for specific data mining tasks such as clustering [10, 18, 37, 38], association rule discovery [35], and decision tree induction [4]. When these approaches are implemented, one of the most challenging issues is determining a sample size that improves the performance of the algorithm without sacrificing the solution quality. Bounds can be developed that allow for a prediction of sample effort needed, but such bounds usually require knowing certain problem parameters and typically overestimate the necessary sample size [6, 14, 41]. On the other hand, too small sample will lead to a bias and degeneration in performance. One possible solution is to use adaptive sampling [4, 6, 24].

In this paper we advocate an alternative approach that is based on a novel formulation of the clustering task as an optimization problem. We also take advantage of the fact that certain optimization techniques have been explicitly designed to account for noisy performance estimates, which are common when performance is estimated using simulation. In particular, one such method is the nested partitions method that can be used to solve general global optimization problems [32, 39] and specifically combinatorial type optimization problems with noisy performance [19]. A characteristic of this method is that wrong moves made due to noise in performance estimates can be automatically corrected in a later move. In the scalable clustering context this means that noisy performance estimates, resulting from smaller samples of instances, may result in more steps taken by the algorithm but any bias will be automatically corrected. This eliminates the need to determine the exact sample size, although the computational performance of the algorithm may still depend on some extent on how it is selected.

Even though the pure NP method guarantees the convergence to the optimal solution, its efficiency and convergence properties can still be improved. To address these, two extensions to the pure NP method are suggested: the statistical selection method and random search method. First, to have more intelligent sampling, we use Nelson Matejciks procedure [31]. Second, Genetic Algorithms (GAs) and k-means algorithm are used to speed convergence and to overcome the difficulty in the backtracking stage of the Nested Partitioning algorithm. For the numerical evaluation, two different types of cancer data are used. Using these extended algorithms, we want to show that the computation time can be reduced by sampling of the instances rather than using all the instances without affecting solution quality. Also we can give proper guideline for proper instances at least used.

Organizations The remainder of this paper is organized as follows. In Sect. 2 we briefly review statistical selection method and clustering techniques. In Sect. 3 we discuss the basis for the new clustering methodology, which is an optimization method called the Nested Partitions method and extended algorithm, Optimized Combinatorial NP Cluster algorithm. In Sect. 4 we present some numerical results of the scalability of the algorithm with respect to the instance dimension, and Sect. 5 contains concluding remarks and suggestions for future research directions.

2 Literature review

2.1 Statistical selection method and random search method

In the discrete event stochastic simulation, to choose the best solution is the maximum or minimum expected simulation result among a set of alternative solutions. Thus, Ranking and Selection (R&S) procedure is a primary matter of apprehension [7]. Bechhofer proposed the fundamentals of R&S at first [1]. The suggested original indifference zone R&S procedure [1] is a single-stage and presumes unknown means and known, common variances for all results. But it doesnt have to be single-stage. We can extend to multi-stage procedures (sequential procedures) assuming common, known variances by defining the user-specified number of observations. Bechhofer et al. [1] presented such methodologies and Koeing and Law [22] extended the indifference zone approach for sifting procedure. As contrasted with the articles discussed, Frey and Dueck [10] presented a representative examplar procedure not requiring reduction to a univariate model. To allocate additional replications, the indifference zone procedures used a least-favorable configuration where the optimal computing budget allocation and Bayesian decision-theoretic methods used an average case analysis [5, 8, 33]. All three procedures are applicable to both two-stage and sequential procedures. These assume that simulation result is independent and normally distributed having unknown mean and variance.

Inoue et al. [16] showed empirically that the two-stage procedure [1, 6] performs competitively with sequential optimal computing budget allocation model and Bayesian decision-theoretic methods when the number of systems under consideration is small \((k<5)\). For a large number of systems \((k\ge 5)\), or when the difference in the mean output of the best system and other systems varies significantly, the two-stage procedure [1, 6] is less effective at identifying the best system. Among two-stage procedures, the Bayesian decision-theoretic procedures have the best overall performance characteristics.

Recently, many articles have tried to unify the fields of R&S and multiple comparison procedures (MCPs). Multiple comparisons with the best (MCB) [31] is one of the most widely used MCPs. To apply MCB in a discrete-event simulation, the simulation runs must be independently seeded and the simulation output must be normally distributed, or averaged so that the estimators used are somewhat normally distributed [31, 36, 38]. There are four R&S-MCB procedures having normally distributed data, but do not require known or equal variance: Nelson and Matejciks Procedure (Procedure NM) [31], two-stage procedure (Procedure B) [1, 6], Watanabe (Procedure W) [40], and Frey and Dueck Procedure (Procedure FD) [10]. Procedure B and Procedure FD are performed in the same manner with the only difference being in the calculation of the sample means. Both algorithms require independence among all observations. The total sample size depends on the sample variance of the systems. So the larger the sample variance, the more replications are required. Unlike these algorithms, Procedure NM requires fewer total observations by employing the common random number where Watanabe [40] used the Bonferroni correction to account for the dependence induced by common random number. However, Nelson and Matejcik [31] observed that the benefit gained from using Procedure W was diminished when the number of systems to be compared would be large. To overcome this problem, they presented Procedure NM where it assumed that the unknown variance covariance matrix exhibited a structure known as sphericity that implied the variances of all paired differences across systems were equal, even though the marginal variances and covariance may be unequal. The difference between Procedure W and NM is the calculation of sample variance. This sample variance affects the total number of sample size for second-stage sampling. Procedure B is superior to Procedure NM and DT in terms of the total observations required to obtain the desired confidence level. The only potential drawback with Procedure B is that the assumption of sphericity may not be satisfied.

When the feasible region is discrete, random search methods are generally used. These methods also cannot usually guarantee a global optimal, and therefore they are often called heuristics methods. Three common random search methods are mentioned below. Tabu Search was originally proposed by Glover [12] for escaping local optimal by using a list of prohibited solutions known as the tabu list. The commonly used diversification method is re-starting from the best solution obtained so far. Another drawback of the tabu search is unless there is a long tabu list, it may reach a previously visited solution. Simulated annealing (SA), introduced by Kirkpatrick et al. [20], is a random search method that is able to escape local optima using a probability function. Unlike the tabu search, SA does not evaluate the entire neighborhood in every iteration. Instead, it randomly chooses only one solution from the current neighborhood and evaluates its costs. That means SA tends to need more iterations to find the best solution than the tabu search method. Another disadvantage is that it does not have memory, and hence it may re-visit a recent solution. There is a combination method of tabu and SA. Genetic algorithms (GAs) were originally developed by Holland [15]. This is one of the most widely known evolutionary methods, which is both powerful and broadly applicable to stochastic optimization [41]. Commonly used operators include selection, reproduction, crossover, and mutation. It mimics the mechanisms of natural selection and natural genetics where stronger individuals are more likely to survive in a competing environment. Thereby, the strongest individual (having the best performance) survives.

2.2 Scalable clustering

Clustering has been an active area of research for several decades, and many clustering algorithms have been proposed in the literature [10, 11, 13, 28, 30]. In particular, considerable research has been devoted specifically to scalable clustering. We will start by briefly describing the various types of clustering algorithms and then mention some specific scalable methods.

Clustering algorithms can be roughly divided into two categories: hierarchical clustering and partitional clustering [18]. In hierarchical clustering all of the instances are organized into a hierarchy that describes the degree of similarity between those instances (e.g., a dendrogram). Such representation may provide a great deal of information, but the scalability of this approach is questionable as the number of instances grows. Partitional clustering, on the other hand, simply creates one partition of the data where each instance falls into one cluster. Thus, less information is obtained but the ability to deal with a large number of instances is improved. Examples of the partitioning approach are the classic k-means and k-medoids clustering algorithms.

There are many other characteristics of clustering algorithms that must be considered to ensure scalability of the approach. For instance, most clustering algorithms are polythetic which means all features are considered simultaneously in tasks so as to determine the similarity of two instances. But if we have big features, this may pose scalability problems. For this reason, monothetic clustering algorithms that consider one feature at a time is considered. Most clustering algorithms are also non-incremental in the sense that all of the instances are considered simultaneously. However, there are a few algorithms that are incremental, which implies that they consider each instance separately. Such algorithms are particularly useful when the number of instances is large. Scalable clustering has received considerable attention in recent years, and here we will mention only a few of the methods. In the early stage, Guha et al., [13] presented the steps of the CURE algorithm that they obtained a sample from the original database, partition the sample into a set of partitions and then cluster each partition, eliminate outliers and cluster the partial clusters. Finally, each data instance is labeled with the corresponding cluster.

Such as k-means and k-medoids, improved scalable versions of partitioning methods. The Clustering LARge Applications (CLARA) algorithm improves the scalability of the PAM k-medoids algorithm by applying PAM to multiple samples of the actual data and returns the best clustering [18]. Jain and Dubes [17] suggest a single pass k-means clustering algorithm with the main idea to use a buffer to save points from the database in a compressed form. This approach was simplified by Farnstrom et al., [7] in an effort to reduce the overhead that otherwise might cancel out any scalability improvements that might be achieved.

There is another way of improving scalability via distributed clustering, where instead of combining all data before clustering, data sets are operated by the affinity propagation (AP) clustering algorithms [10]. The propose algorithm in this paper is a partitional clustering algorithm that try to find cluster centers and uses random sampling to improve scalability. In that sense, it is the most similar to AP clustering algorithm, expect it guarantees the s optimization solution.

3 Hybrid clustering algorithm

3.1 Nested partitions (NP) method

The main framework of suggested algorithm was suggested by Shi and lafsson [39] which is called Nested Partitions (NP) Method. This is an optimization method that solve general global optimization problems of the following form:

$$\begin{aligned} \min _{x\in X}f(x) \end{aligned}$$
(1)

where x is a point in a n-dimensional space X and \(f:X\rightarrow R\) is a real-valued performance measure defined on this space. This performance may or may not be known deterministically. In our paper, we define X as the space of all clusters and the function measures some quality of the clusters.

The concept of the NP method is very simple. In each step, the method partitions the feasible region into subsets by the rule and concentrates on the computational effort in those subsets that are considered hopeful which might have the best answer. The partitioning rule depends on a case. Thus there is no fixed rule but that all subsets are disjoint. At each iteration of the algorithm.

We assume that there is a region, which is considered the most promising region having the most likely to contain the best solution at every iteration. Then this most promising region is partitioned into M regions and the remaining of the feasible region is aggregated into one region called the surrounding region. So we have \(M+1\) disjoint subsets at each iteration. We sample using some random sampling scheme, and calculate an average of the performance function, a promising index for each of these \(M+1\) regions. Then these promising indices are compared to determine which region has the most promising index, the smallest average of performance function. The best sub-region having the best performance is the most promising region. However, if the best performance is found in the surrounding region, the algorithm backtracks and a larger region containing the current most promising region becomes the new most promising region. We then partitioned and sampled in a similar fashion from this new most promising region. This process is repeated until the criteria for termination is satisfied. The main components of the method are:

  • Partitioning At each iteration the feasible region is partitioned into subsets by predefined rule. This partitioning creates a tree of subsets which is called partitioning tree.

  • Creating feasible solutions To evaluate each of the subsets, a randomly generated sample of solutions is obtained from each subset and used to estimate the performance of each region as a whole.

  • Evaluating promising index To select the most promising region, calculate the promising index for each subregion.

  • Retracing If the best solution is found in the surrounding region, the algorithm retraces to what was the most promising region in the previous iteration.

This method combines adaptable global sampling with local heuristic algorithm. It uses a supple partitioning method to divide the design space into regions. Each region should be evaluated individually and then aggregates the evaluation results from each region to determine the region for concentrating the computational effort. This means that the NP method intelligently samples from the entire design space and concentrates the sampling effort by methodical partitioning of the design space.

3.2 Defining clusters

We can manage suggesting algorithm using the NP format. From the view of this approach we presume that we partition a whole data set into several and that each cluster is defined by its center (every instance are set to the nearest center). So the coordinates of each center of the cluster are the decision variables. We notate the jth cluster as \(x^{(j)}=(x^{(j)}_{1},x^{(j)}_{2},...,x^{(j)}_{n},)\), where \(j=1,2,...,m\). Therefore, this clustering problem tries to locate the centers for optimizing certain performance.

In case of clustering defining a performance measure to be optimized is very nonobjective, because there are no standard criteria for constituting a good cluster. But we have the most common measures that can be used: probably maximizing similarity within a cluster (that is, maximizing homogeneity or compactness), and minimizing similarity between different clusters (that is, maximizing separability between the clusters).

A particular strength of suggesting algorithm is that it can adopt any measure of cluster performance, even combination of measures. We define the function f as the measure of the quality of a cluster. For performance comparison, we will compare our approach to other well-known methods that focus on the within similarity or compactness of a cluster. To make sure of performance comparison, we simplify the measure as a single measure of similarity within cluster:

$$\begin{aligned} f(x^{(1)},x^{(2)},...,x^{(m)})=\sum _{y\in \psi }\sum _{i=1}^{n}|y_i-x_i^{[y]}|^2 \end{aligned}$$
(2)

We define \(\psi \) as the space of all instances, \(y\in \psi \) as a specific instance in this space, \(x^{[y]}\) as the cluster center which the instance is assigned, and \(|y_i-x_i^{[y]}|\) as the difference between the ith coordinate of the instance and the homologous center. So the objective function is the sum of the distance of n data points from their respective cluster centers. By using such a simple measure we focus on the performance of the algorithm itself as we mentioned before.

When we use NP method, the main implementation issue is defining the partitioning rule. By finding cluster centers for one feature at a time, we manage this issue. In other words, at each level of the partitioning tree, the values for all centers are limited to a range for one feature. This bounds the subsets that make the partitioning tree. Using the idea of generic NP method, we do random sampling from each subset, and apply the k-means algorithm to those random samples to speed convergence. The resulting improved centers are used to select the most promising region at the next iteration. This most promising region is partitioned further and remaining regions are aggregated as surrounding region, and so forth.

Fig. 1
figure 1

Simple example for clustering using NP methodology

Figure 1 shows that the simple example when we apply NP methodology to the clustering problem. This is the problem with two dimensions and the total number of cluster is 2. To simplify the problem, we assume that each dimension has only two values. That is, \(x_i=\{1,2\}, i=1,2\). Figures 2 and 3 demonstrates a partitioning tree whereas all features can take two different values. The objective of this problem is to find the optimal location of 2 clusters (identified as \(C_1\) and \(C_2\)). This partitioning method helps the scalability of the method with respect to the feature dimension. It focuses on fixing one feature at a time and repeated until all features are fixed at every iterations. During the random sampling stage, all features are used simultaneously to select subregions. This approach can thus be thought of as having elements of both monothetic and polythetic clustering. This partitioning approach helps the scalability of the method with respect to the feature dimension. It concentrates one feature at a time and is in that sense monothetic. But all features are randomly assigned values during the random sampling stage, and thus all features are used simultaneously to select subregions. So this approach can be thought of both monothetic and polythetic clustering.

Fig. 2
figure 2

First iteration of the example

Fig. 3
figure 3

Second iteration of the example

It is also significant to note that the partitioning tree makes a structure on the space of all possible clusters, and determines the effectiveness of the search through this space. Furthermore, investigating effective algorithms about ordering features is an important research topic in the future.

Figure 2 shows initial partitioning. First dimension of each cluster are set as (1,1) in the first subset. For second subset, first dimension of each cluster are set as (1,2) and for third subset, first dimension of each cluster are set as (2,2).

Random sampling is performed after partitioning of three subsets. For instance, every sample point of this subset has a fixed first dimension for the first subset; the first cluster and the second cluster are fixed as 1. Centers can be randomly assigned from the values 1,2 for the remaining dimension. A similarity value is calculated by formulation (2) using the sampling from each subset.

The promising index is calculated for each subset based on these values. The most promising region is the first subset having the smallest promising index after calculating promising index for all subsets.

Figure 3 shows that the most promising index is first subset at first iteration. The partitioning starts from the first subset of the second iteration. At the same way, the second dimension can take 2 different values, and three different subsets can be obtained like in the 1st iteration. The second iteration is the maximum depth because there are two dimensions in this problem. There is one more subset which is called the surrounding region from the second iteration. The subset which contains center \(C_1(1,\cdot ),C_2(2,\cdot )\) is surrounding region (in Fig. 3). After sampling from all subsets, the most promising index is found in the second region, having the first clusters coordinate (1,1) and the second clusters coordinate (1,2). These coordinates are optimal because they minimize the similarity of the problem.

3.3 Optimized combinatorial cluster algorithms

As already mentioned, the NP method has two apparent drawbacks. There are two types of error in the estimate of each region: First, sampling error due to the use of sample points in the region, and the estimation error because of the use of simulation. Secondly, there is no guarantee whether the movement is correctly made in each iterations. To get over this problem, a two-stage method is suggested both of these problems [19, 27, 36]. It is possible to guarantee that the correct move is made by using statistical selection methods. Because statistical selection methods determine a second-stage sample size to use different numbers of sample points in each region, while simultaneously controlling the total error. To take on this hypothesis, we use Nelson and Matejcik [31] and incorporate with our scheme.

One main idea of statistical selection methods is that the number of sample points gained from each system should be proportional to the variance of the performance of each system. This method is very helpful when incorporated with NP scheme, especially the surrounding region which is expected having high variance than the other sub-regions needing a larger sampling size.

To state the two-stage approach rigorously, let \(D_{ij}(k)\) be the ith set of random sample points selected from the region \(\sigma _j(k)\) in the kth iteration, where \(i\ge 1\) and \(j=1,2,...,M+1\). In addition, \(N=|D_{ij}(k)|,\theta \in D_{ij}(k)\), and \(L(\theta )\) set as the initial number of sample points assuming constant, a point in that set and a simulation estimate of the performance of this point each. Then in the kth iteration, for every i,

$$\begin{aligned} X_{ij}(k)=\min _{\theta \in D_{ij}(k)}L(\theta ) \end{aligned}$$
(3)

is an performance estimate of the region \(\sigma _j\), which is referred as the ith system performance for the jth system, \(i\ge 1, j=1,2,...,M+1\).

First, two-stage ranking and selection method acquires \(n_0\) system estimates. Then determines the total number of \(N_j\) of system estimates using that information needed from the jth system, which is, subregion \(\sigma _j(k)\) based on the variance. If we want to choose the subregion as correctly with probability at least \(P^*\), this number should be selected to be enough large to an indifference zone of \(\epsilon >0\).

First-stage samples are randomly obtained from each region by using random numbers for each region. Given a fixed first-stage sample size \(n_0\), we can determine sample variance S of the difference of the sample means. Using final sample size given indifference zone \(\epsilon \) can be computed.

$$\begin{aligned} N = \max \left\{ {{n_0}\left\lceil {{{\left( {{{gS} \over \varepsilon }} \right) }^2}} \right\rceil } \right\} \end{aligned}$$
(4)

Note that this requires the constant g which affected by the initial sample size \(n_0\) and the number of regions M that are compared.

Genetic Algorithm (GA) is used since GA is one of the well-known and effective heuristic algorithms although there is no guarantee of global convergence. Therefore, sample points that better represent the best performance in their region can be obtained by applying GA search to each sub-region. The next promising region can be more exactly determined based on these sample points. This is because GA guarantees local optimums at least for each region by finding the best solution of each region. To improve the performance of the NP method, we combine the well-known heuristic clustering algorithm, k-means algorithm, also. As a result, a combined algorithm retains the benefits of all of the methods.

3.4 Three cluster algorithms

In this section, we suggest 3 types of Cluster Algorithms, Algorithm NP/NM/Km, Algorithm NP/NM/Genetic, and Algorithm NP/NM/Km/Genetic. To present a detail description in Table 1, we need the following notations:

Table 1 The notations

The squared error criterion function is used as a performance measure. Its calculation is as follows.

$$\begin{aligned} J(z)=\sum _{i=1}^{N_C}\sum _{x\in I^t_j}|x-z^t_j|,i=1,...,N,j=1,2,...,M_{\sigma (k)+1} \end{aligned}$$

Using a sample of instances, the estimate

$$\begin{aligned} \hat{L}(z)=\sum _{i=1}^{N_C}\sum _{x\in I^t_j}|x-z^t_j|,i=1,...,N,j=1,2,...,M_{\sigma (k)+1} \end{aligned}$$

is used instead of J(z).

We can now state the detailed algorithm.

4 Numerical results

In order to evaluate the performance of these algorithms, two different sizes of cancer data B-type and S-type are considered [3]. The B-type data set has 9 features and 699 instances; whereas, the S-type data set has 9 features and 286 instances. By varying the number of instances, we show that the algorithm can use a random sample of instances without sacrificing solution quality and determine appropriate guidelines for how many instances are needed. We vary the number of instances as 100, 50, 28, 15, 4.5, 1.5, 0.7, and 0.5% of total instances. In case of B-type data set, they are 699, 350, 200, 100, 50, 30, 10, 5, and 3. For S-type data set, we use 286, 143, 82, 41, 20, 12, 4, 2, and 1 as instances. Figure 4 shows numerical results of B-type cancer data set of each duplication. We cant find the pattern of similarity of this Fig. 4 (left). But we can notice that partial instances well performs in terms of computation time. In NP/NM/Km algorithm, the computation time (right) of 50% of instances is almost 1/4 of 100% of instances. From all algorithms, we find that using partial instances needs less computation time than using full instances. We can get similar results from S-type cancer data set in Fig. 5. We can save the time by hiring partial instances.

Fig. 4
figure 4

Numerical results of B-type cancer data set (left similarity value, right computation time)

Fig. 5
figure 5

Numerical results of S-type cancer data set

Table 2 Effect of using fraction of instance space for the B-type and S-type data sets
Fig. 6
figure 6

Numerical results for different fraction of instances used for B-type cancer data set

Fig. 7
figure 7

Numerical results for different fraction of instances used for S-type cancer data set

Table 2 shows that the mean and the standard deviation of accuracy, and computation speed for B-type and S-type data set. Lets see the B-type data set first. From these results, we get several important things. For every algorithm, we get the best solution in terms of solution quality when we use half instances. We can decrease the computation speed without changing the solution quality by using half of the instances. For example, in NP/NM/Km algorithm, the similarity value and computation time are 4259 and 394,780 each when all instances are used and 4208 and 101,670 each when 50% of instances are used. Computation time is cut almost 75% with no reduction in quality. We can get similar results from other algorithms. In NP/NM/Genetic algorithm, we can save time 63% when we use half instances. In NP/NM/Genetic/Km algorithm, we can get the best solution when we use 28% of instances. Nevertheless, we can cut down the computation time by 35% of full instances. Similar results are acquired from S-type data set. The NP/NM/Km algorithm and NP/NM/Genetic algorithm show that 50% of instances give the best solution. The computation speed of NP/NM/Genetic algorithm does not make of much of difference. Unlike the above algorithms, NP/NM/Genetic/Km algorithm gives the best solution when full instances are used. At both B-type and S-type data set, we get same computation results. Computation time is the best when 0.5% of instances is used in NP/NM/Genetic algorithm and NP/NM/Genetic/Km algorithm and 15% of instances is the best when NP/NM/Km algorithm is used.

figure e

Figures 6 and 7 are the average of similarity and computation time for B-type and S-type data set. As we already commented, 50% of instances give the best similarity value regardless of type and algorithm. Generally, NP/NM/Km algorithm gives the best solution. And NP/NM/Genetic/Km algorithm gives the worst solution. This results definitely clear for B-type data set, a larger data. The more fractions we hire, the better the solution we get. Similar results happen in computation speed. The NP/NM/Km algorithm is the best and NP/NM/Genetic algorithm is the worst. And this is apparently clear at larger data set. Especially the patterns between the algorithms are same at larger data set. Computation time speedily cut at 50% fraction and not much difference of computation quality after 28% fraction which means if we have to choose between 25–0% fraction, we should choose 25% fraction.

5 Conclusions and future research

In this paper, we suggest Optimized Combinatorial NP Cluster algorithms for stochastic process that will be crucial for B-type, complex data with random sampling. As we can see, the computation time can be cut by fraction of instances rather than using all the instances. This is more noticeable in cases of B-type data problem, larger data set. When only half of the instances are used, the computation time is cut without affecting solution quality. In addition, the standard deviation is declines, which means computation time is getting stable. But with too few instances, the solution quality becomes significantly worse while the computation time goes up. And hiring k-means algorithm than Genetic algorithm gives better solution.

figure f
figure g

For further research, we can extend these algorithms to the more various statistical selection method and random search method that can be expected more quality in computation and similarity.