Abstract

The integration of machine learning techniques and metaheuristic algorithms is an area of interest due to the great potential for applications. In particular, using these hybrid techniques to solve combinatorial optimization problems (COPs) to improve the quality of the solutions and convergence times is of great interest in operations research. In this article, the db-scan unsupervised learning technique is explored with the goal of using it in the binarization process of continuous swarm intelligence metaheuristic algorithms. The contribution of the db-scan operator to the binarization process is analyzed systematically through the design of random operators. Additionally, the behavior of this algorithm is studied and compared with other binarization methods based on clusters and transfer functions (TFs). To verify the results, the well-known set covering problem is addressed, and a real-world problem is solved. The results show that the integration of the db-scan technique produces consistently better results in terms of computation time and quality of the solutions when compared with TFs and random operators. Furthermore, when it is compared with other clustering techniques, we see that it achieves significantly improved convergence times.

1. Introduction

In recent years, different optimization methods based on evolutionary concepts have been explored. These methods have been used to solve complex problems, therein obtaining interesting levels of performance [13], and many such methods have been inspired by concepts extracted from abstractions of natural or social phenomena. These abstractions can be interpreted as search strategies according to an optimization perspective [4]. These algorithms are inspired, for example, by the collective behavior of birds, e.g., the cuckoo search algorithm [5]; the movement of fish, e.g., the artificial fish swarm algorithm (AFSA) [6]; particle movement, e.g., particle swarm optimization (PSO) [7]; the social interactions of bees (ABC) [8]; and the process of musical creation, as in the search for harmony (HS) [9] and in genetic algorithms (GA) [10], among others. Many of these algorithms work naturally in continuous spaces.

On the other hand, lines of research that allow robust algorithms associated with the solution of combinatorial optimization problems (COPs) to be obtained are of great interest in the areas of computer science and operations research. This interest is currently mainly related to decision making in complex systems. Many of these decisions require the evaluation of a very large combination of elements in addition to having to solve a COP to find a feasible and satisfactory result. Examples of COPs can be found in the areas of logistics, finance, transportation, biology, and many others. Depending on the definition of the problem, many COPs can be classified as NP-hard. Among the most successful ways to address such problems, one common method is to simplify the model to attempt to solve instances of small to medium size using exact techniques. Large problems are usually addressed by heuristic or metaheuristic algorithms.

The idea of hybridization between metaheuristic techniques and methods from other areas aims to obtain more robust algorithms in terms of solution quality and convergence times. State-of-the-art proposals for hybridization mainly include the following: (i) mateheuristics, which combines heuristics or metaheuristics with mathematical programming [11]; (ii) hybrid heuristics, which corresponds to the integration of different heuristic or metaheuristic methods [12]; (iii) simheurístics, which combines simulation and metaheuristics [13]; and (iv) the hybridization between metaheuristics and machine learning [14]. Machine learning can be considered as a set of algorithms that enable the identification of significant, potentially useful information and learning through the use of data. In this work, useful information will be obtained using the data generated by a continuous metaheuristic algorithm through the use of the db-scan unsupervised learning technique to obtain robust binarizations of this algorithm.

Within the lines of this discussion, we aim to provide the following contributions:(i)A novel automatic learning binarization algorithm is proposed to allow metaheuristics commonly defined and used in continuous optimization to efficiently address COPs. This algorithm uses the db-scan unsupervised learning technique to perform the binarization process. The selected metaheuristics are particle swarm optimization (PSO) and cuckoo search (CS). Their selection is based on the fact that they are commonly used in continuous optimization and enable a method for adjusting their parameters in continuous spaces.(ii)These hybrid metaheuristics are applied to the well-known set covering problem (SCP). This problem has been studied extensively in the literature, and therefore, there known instances where we can clearly evaluate the contribution of the db-scan binarization operator. On the other hand, the SCP has numerous practical real-world applications such as vehicle routing, railways, airline crew scheduling, microbial communities, and pattern finding [1518].(iii)Random operators are designed to study the contribution of the db-scan binarization algorithm in the binarization process. Additionally, the behavior of db-scan binarization is studied, comparing it with binarization methods that use k-means and transfer functions (TFs). Finally, the binarizations obtained by db-scan are used to solve a real-world problem.

The remainder of this article is structured as follows. Section 2 describes the SCP and some of its applications. In Section 3, a state-of-the-art hybridization between the areas of machine learning and metaheuristics is provided, and the main binarization methods are described. Later, in Section 4, the proposed db-scan algorithm is detailed. The contributions of the db-scan operator are provided in Section 5. Additionally, in this section, the db-scan technique is studied by comparing it with other binarization techniques that use k-means and TFs as a binarization mechanism. In Section 6, a real-world application problem is solved. Finally, in Section 7, conclusions and some future lines of research are given.

2. Set Covering Problem

SCP is one of the oldest and most-studied optimization problems and is well known to be NP-hard [19]. Nevertheless, different solution algorithms have been developed. There exist exact algorithms that generally rely on the branch-and-bound and branch-and-cut methods to obtain optimal solutions [20, 21]. These methods, however, struggle to solve SCP instances that grow exponentially with the problem size. Even medium-sized problem instances often become intractable and can no longer be solved using exact algorithms. To overcome this issue, different heuristics have been proposed [22, 23].

For example, [22] presented a number of greedy algorithms based on a Lagrangian relaxation (called Lagrangian heuristics). Caprara et al. [24] introduced relaxation-based Lagrangian heuristics applied to the SCP. Metaheuristics, e.g., genetic algorithms [25], simulated annealing [26], and ant colony optimization [27], have also been applied to solve the SCP. More recently, swarm-based metaheuristics, such as the cat swarm [28], cuckoo search [29], artificial bee colony [8], and black hole [30] metaheuristics, have also been proposed.

The SCP has many practical applications in engineering, e.g., vehicle routing, railways, airline crew scheduling, microbial communities, and pattern finding [15, 16, 18, 31].

The SCP can be formally defined as follows. Let be an zero-one matrix, where a column j covers a row i if , and a column j is associated with a nonnegative real cost . Let and be the row and column set of A, respectively. The SCP consists of searching a minimum cost subset for which every row is covered by at least one column , i.e.,where if and , otherwise.

3.1. Related Binarization Work

A series of metaheuristic algorithms designed to work in continuous spaces have been developed. Particle swarm optimization (PSO) and cuckoo search (CS) are two of the most commonly used metaheuristic algorithms. On the other hand, the existence of a large number of -hard combinatorial problems motivates the investigation of robust mechanisms that allow these continuous algorithms to be adapted to discrete versions.

In a review of the state-of-the-art binarization techniques [32], two approximations were identified. The first approach considers general methods of binarization. In those general methods, there is a mechanism that allows the transformation of any continuous metaheuristic into a binary one without altering the metaheuristic operators. In this approach, the main frameworks used are TFs and angle modulation. The second approach corresponds to binarizations in which the method of operating metaheuristics is specifically altered. Under this second approach, notable techniques include quantum binary and set-based approaches.

3.1.1. Transfer Functions

The simplest and most widely used binarization method corresponds to TFs. TFs were introduced by [33] to generate binary versions of PSO. This algorithm considers each solution as a particle. The particle has a position given by a solution in an iteration and a velocity that corresponds to the vector obtained from the difference of the particle position between two consecutive iterations. The TF is a very simple operator and relates the velocity of the particles in PSO with a transition probability. The TF takes values from and generates transition probability values in . The TFs force the particles to move in a binary space. Depending on the function’s shape, they are usually classified as S-shape [34] and V-shape functions [1]. Once the function produces a value between 0 and 1, the next step is to use a rule that allows obtaining 0 or 1. For this, well-defined rules are applied that use the concepts of complement, elite, and random, among others.

3.1.2. Angle Modulation

This method is based on the family of trigonometric functions shown in equation (3). These functions have four parameters responsible for controlling the frequency and displacement of the trigonometric function:

The first time this method was applied to binarizations was in PSO. In this case, binary PSO was applied to benchmark functions. Assume a given binary problem of dimension n, and let be a solution. We start with a four-dimensional search space. Each dimension represents a coefficient of equation (3). Then, every solution is associated with a trigonometric function . For each element , the following rule is applied:

Then, for each initial solution of 4 dimensions , the function , which is shown in equation (3), is applied and then equation (4) is utilized. As a result, a binary solution of dimension n, , is obtained. This is a feasible solution for our n-binary problem. The angle modulation method has been applied to network reconfiguration problems [35] using a binary PSO method, to an antenna position problem using an angle modulation binary bat algorithm [36], and to a multiuser detection technique [37] using a binary adaptive evolutionary algorithm.

3.1.3. Quantum Binary Approach

There are three main types of algorithms in research that integrates the areas of evolutionary computation (EC) and quantum computation [38].(1)Quantum evolutionary algorithms: these methods correspond to the design of EC algorithms to be applied in a quantum computing environment(2)Evolutionary-based quantum algorithms: these algorithms attempt to automate the generation of new quantum algorithms using evolutionary algorithms(3)Quantum-inspired evolutionary algorithms: this category uses quantum computing concepts to strengthen EC algorithms

In particular, the quantum binary approach is a type of quantum-inspired evolutionary algorithm. Specifically, this approach adapts the concepts of q-bits and superposition used in quantum computing applied to traditional computers.

In the quantum binary approach, each feasible solution has a position and a quantum q-bit vector . Q represents the probability of taking the value 1. For each dimension j, a random number between [0, 1] is generated and compared with : if , then ; otherwise, . The upgrade mechanism of the Q vector is specific for each metaheuristic.

The main difficulty that general binarization frameworks face is related to the concept of spatial disconnect [39]. A spatial disconnect originates when nearby solutions generated by metaheuristics in the continuous space are not transformed into nearby solutions when applying the binarization process. Roughly speaking, we can think of a loss of the continuity of the framework. The spatial disconnect phenomenon consequently alters the properties of exploration and exploitation, and therefore the precision and convergence of the metaheuristics decrease. A study was conducted on how the TFs affect the exploration and exploitation properties in [40]. For angle modulation, a study was conducted in [39].

On the other hand, specific binarization algorithms that modify the operators of the metaheuristic are susceptible to problems such as Hamming cliffs, loss of precision, search space discretization, and the curse of dimensionality [39]. This was studied by [41] and for the particular case of PSO by [42]. In the latter, the authors observed that the parameters of the Binary PSO change the speed behavior of the original metaheuristic.

3.2. Hybridizing Metaheuristics with Machine Learning

Machine learning concerns algorithms that are capable of learning from a dataset [43]. This learning can be supervised, unsupervised, or semisupervised. Usually, these algorithms are used in problems of regression, classification, transformation, dimensionality reduction, time series, anomaly detection, and computational vision [44], among others. On the other hand, metaheuristics correspond to a broad family of algorithms designed to solve complex problems without the need for a deep adaptation of their mechanism when changing problems. They are incomplete techniques and generally have a set of parameters that must be adjusted for proper operation.

When a state-of-the-art integration between metaheuristic and machine learning algorithms is developed, the integration is considered bidirectional. This means that there are studies whereby metaheuristic algorithms contribute to improving the performance of machine learning algorithms, and there are investigations where machine learning algorithms improve the convergence and quality of metaheuristic algorithms. In the case of metaheuristics that improve the performance of machine learning algorithms, we see that integration is found in all areas. In classification problems, we find that such algorithms have been used mainly in feature selection, feature extraction, and the tuning of parameters. In [45], a genetic-SVM algorithm was developed to improve the recognition of breast cancer through image analysis. In this algorithm, genetic algorithms were used for the extraction of characteristics. A multiverse optimizer algorithm was used in [46] to perform the feature selection and parameter tuning of SVM on a robust system architecture. The training of feed-forward neural networks was addressed using an improved monarch butterfly algorithm in [47]. In [48], a geotechnical problem was addressed by integrating a firefly algorithm with the least squares support vector machine technique. For the case of regression problems, we found in [49] an application in the prediction of the compressive strength of high-performance concrete using metaheuristic-optimized least squares support vector regression. The improved prediction of stock prices was addressed in [50], therein integrating metaheuristics and artificial neural networks. Additionally, in [51], a stock price prediction technique was developed using a sliding-window metaheuristic-optimized machine learning regression applied to Taiwan’s construction companies. In [52], using a firefly version, the least squares vector regression parameters were optimized with the aim of improving the accuracy of the prediction in engineering design. In the case of unsupervised learning techniques, we find that metaheuristics have contributed significantly to clustering techniques. For example, in [53], an evolutionary-based clustering algorithm that combines a metaheuristic with a kernel intuitionistic fuzzy c-means method was proposed with the aim of designing clustering solutions to apply them to different types of datasets. Also in clustering, centroid search is a problem type that suffers a large algorithmic complexity. This problem consists of the search for centroids with the objective of grouping the set of objects studied in an improved manner. Because this problem is NP-hard, approximation methods have been proposed. For example, an improved artificial bee colony algorithm was developed in [54] with the goal of solving the energy efficiency clustering problem in a wireless sensor network. In [55], a mathematical model and a clustering search metaheuristic were developed for addressing the helicopter transportation planning of oil and gas production platform employees.

On the other hand, when looking for the contributions of machine learning techniques in metaheuristic algorithms, two main lines of research can be distinguished. The first line of research corresponds to specific integrations. In these integrations, machine learning techniques are integrated through a specific operator in one of the modules that establish the metaheuristic. The second line of research explores general integrations, where the machine learning techniques work as a selector of different metaheuristic algorithms, therein choosing the most appropriate for each instance. A metaheuristic, in addition to its evolution mechanism, usually uses solution initiation operators, solution perturbation, population management, binarization, the tuning of parameters, and local search operators, among others. The specific integrations explore the machine learning application on some of these operators. For the case of parameter tuning in [56], the parameter tuning of a chess rating system was implemented. Based on decision trees and using fuzzy logic, a semiautomatic parameter tuning algorithm was designed in [57]. Another relevant area of research is related to the design of binary versions of algorithms that work naturally in continuous spaces. This line of research aims to apply these binary versions in combinatorial problems. In this area, we find in [2] the application of unsupervised learning techniques to perform the binarization process. In [29], the percentile concept was explored in the process of generating binary algorithms. Additionally, in [17], the big data Apache spark framework was applied to manage the size of the solution population to improve the convergence times and quality of results. The randomness mechanism is frequently used for the initialization of the solutions of a metaheuristic. However, machine learning has been used to improve the solution initialization stage. In [58], case-based reasoning was used to initialize a genetic algorithm and apply it to the weighted-circle design problem. Hopfield neural networks were used in [59] to initiate solutions of a genetic algorithm that was used to solve an economic dispatch problem.

When analyzing the methods found in the literature addressing general integrations of machine learning algorithms on metaheuristics, we find three main groups: algorithm selection, hyperheuristics, and cooperative strategies. The objective of algorithm selection is to choose from a set of algorithms and a group of associated characteristics for each instance of the problem an algorithm that performs best for similar instances. In the hyperheuristics strategy, the goal is to automate the design of heuristics or metaheuristic methods to address a wide range of problems. Finally, cooperative strategies consist of combining algorithms in a parallel or sequential manner to obtain more robust methods. The cooperation can be completed by sharing the complete solution or partially when only part of the solution is shared. In [60], the berth scheduling problem at bulk terminals was addressed by algorithm selection techniques. The problem of nurse rostering through a tensor-based hyperheuristic algorithm was addressed in [61]. Finally, a distributed framework based on agents was proposed in [62]. In this case, each agent corresponds to a metaheuristic, and it has the ability to adapt through direct cooperation. This framework was applied to the problem of permutation flow stores.

4. Binary Db-Scan Algorithm

The binary db-scan algorithm is composed of five operators. The first operator, which will be detailed in Section 4.1, initializes the solutions. After the solutions are started, the next step is to verify if the maximum iteration criterion is met. When the criterion has not been met, the binary db-scan operator is executed. This operator continuously executes the metaheuristics and then clusters the solutions considering the absolute value of the velocity of the solutions. The details of this operator are described in Section 4.2. Subsequently, using the clusters generated by the db-scan operator, the transition operator will proceed to binarize the solutions generated by the continuous metaheuristics. When points are identified by db-scan as outliers, a transition operator for outliers is applied. The transition operator and the outlier operator are described in Section 4.3. Finally, when the obtained solutions do not satisfy all the restrictions, the repair operator described in Section 4.4 is applied. Additionally, a heuristic operator is used in the initiation and repair of the solutions. This operator is detailed in Section 4.5. The flow diagram of the binary db-scan algorithm is shown in Figure 1.

4.1. Initiation Operator

This operator attempts to initiate the solutions that the binary db-scan algorithm will use. The first step, the SelectRandomColumn() function select a column randomly. Then, the operator asks if the row coverage constraint is fulfilled. When the constraint is not met, the initiating operator calls the heuristic operator. This heuristic operator receives the list of columns that currently has the solution and returns a new column to be incorporated. The details of the heuristic operator are described in Section 4.5. The call to the heuristic operator is executed until all rows are covered. The procedure for initiating the solutions is shown in Algorithm 1.

(1)Function Initiation ()
(2)Input
(3)Output Solution S
(4) SelecRandomColumn ()
(5)while All row are not covered do
(6).append (Heuristic ())
(7)end while
(8)
(9)return S
4.2. Binary Db-Scan Operator

The goal of the binary db-scan operator is to group the different solutions obtained by the execution of the continuous metaheuristics. When considering solutions as particles, we will understand the position of the particle as the location of the solution in the search space. On the other hand, the velocity represents the transition vector of the particle from iteration t to iteration .

To perform the clustering, the density-based spatial clustering of applications with noise (db-scan) algorithm is used. Db-scan is a data grouping algorithm proposed in 1996 by [63]. Db-scan uses the concept of density to perform the clustering: given a set of S points in a metric space, db-scan groups the points with many nearby neighbors, marking as outliers those that are alone in low-density regions. Db-scan requires two parameters: a radius ϵ and a minimum number of neighbors δ. The db-scan algorithm can be divided into the following steps:(i)Find the points in the ϵ neighborhood of every point and identify the core points with more than δ neighbors(ii)Find the connected components of core points on the neighbor graph, ignoring all noncore points(iii)Assign each noncore point to a nearby cluster if the cluster is an ϵ neighbor; otherwise, assign it to noise

Let be a swarm intelligence continuous metaheuristic and be the position list of the solutions given by at iteration t. The binary db-scan operator has input parameters and and aims to cluster the solutions given by . The first step of the operator is to iterate the list using to obtain the list of positions at iteration . Subsequently, using and , we obtain the list with transition velocities .

Let be the velocity vector in the transition between t and corresponding to particle p. This vector has n dimensions, where n depends on the number of columns that the problem possesses. Let be the value for dimension i of the vector . Then, corresponds to the list of absolute values of . Next, we apply db-scan to the list , thereby obtaining the number of clusters and the cluster to which each belongs , where . The procedure for the binary db-scan operator is shown in Algorithm 2.

(1)Function BinaryDbscan ( )
(2)Input ,
(3)Output ,
(4) applyMh ()
(5) getVelocityList (, )
(6) getClusterList ()
(7) applyDbscan ()
(8)return
4.3. Transition Operator

The db-scan operator returns the number of clusters and a list with the cluster identifier to which each element belongs: . The purpose of the transition operator is to binarize the solutions generated by and clustered by the binary db-scan operator. To perform the binarization, we must consider that the identifier of the clusters will be assigned in the following manner: a value of 0 will be assigned to the cluster that has with the lowest absolute value. Let and be elements of clusters J and I, respectively, and ; then, . The value of Id will be consecutive integers and . Finally, for the cases identified by db-scan as outliers, we have (). Then, each cluster will be assigned a transition probability given by equation (5). In this equation, α corresponds to the initial transition coefficient, and β corresponds to the transition probability coefficient.where T is the total number of clusters not considering outliers.

Finally, to execute the binarization process, consider as the position of a particle in iteration t. Let be the value of the dimension i for the particle , and let be the velocity of the particle in the i dimension to transform from iteration t to iteration . Additionally, consider that , where J is one of the clusters identified by the binary db-scan operator. Then, we use equation (6) to generate the binary position of the particles in iteration .

When , the procedure is as follows: From the complete list of outliers, the are ordered, starting with the solution with the best fitness and proceeding to those with the worst performance. The top 20% of solutions in terms of fitness is chosen, and a transition value of α is applied. For the remaining elements, a transition value of is applied. Finally, once the transition operator is applied, a repair operator is used, as described in Section 4.4 for solutions that do not satisfy some of the restrictions. The details of the transition operator are shown in Algorithm 3.

(1)Function Transition (, , )
(2)Input , ,
(3)Output
(4)for , in (, ) do
(5)if not in then
(6)   getTransitionProbabily (, ) –equation (5)
(7)else
(8)   getOutlierTransitionProbabily (, )
(9)end if
(10).append getBinaryPosition (, ) –equation (6)
(11)end for
(12)for in do
(13) Repair ()
(14)end for
(15)return
4.4. Repair Operator

The repair operator is executed after the execution of the transition operator. In the event that the coverage condition of the rows is not met, the repair operator uses the heuristic operator to add new columns. After all the rows are covered, we verify that there are no groups of columns that cover the same rows. The details of the repair operator are shown in Algorithm 4.

(1)Function Repair ()
(2)Input Input solution
(3)Output The repaired solution
(4)while needRepair () = = True do
(5).append (Heuristic ())
(6)end while
(7) deleteRepeatedItem ()
(8)return
4.5. Heuristic Operator

When a solution needs to be started or repaired, a heuristic operator is used that selects a new element. As an input parameter, the operator considers the solution , which needs to be completed. In the case of being a new solution, . With the list of columns belonging to , we obtain the set of rows R not covered by the solution. With the set of rows not covered and using equation (7), we obtain in line 4 the best 10 rows to be covered. With this list of rows () online 5, we obtain the list of the best columns according to the heuristic shown in equation (8). Finally, in line 6, we randomly obtain the column to incorporate. The details of the heuristic operator are shown in Algorithm 5.where Li is the sum of all ones in row i.where Mj is the set of rows covered by Colj.

(1)Function Heuristic ()
(2)Input Input solution
(3)Output The new column
(4) getBestRows (, N = 10)
(5) getBestColumns (, M = 5)
(6) getColumn ()
(7)return

5. Results

To determine the contribution of the db-scan algorithm to the binarization process, three groups of experiments are performed. The first group compares the db-scan algorithm with two random operators, as detailed in Section 5.2. The second group considers comparing db-scan with the k-means clustering technique. The results are shown in Section 5.3, and the details of the k-means technique can be found in [1]. Finally, the third group is shown in Section 5.4 and compares the binarization performed by db-scan with the binarization using TFs. The latter is a technique widely used in the binarization of continuous algorithms. Additionally, in Section 5.1, we describe the methodology used to define the parameters of the utilized algorithms.

CS [5] and PSO [7] were the selected algorithms. These algorithms are chosen for three reasons. Both algorithms are quite simple to parameterize; thus, the study can focus on the binarization technique rather than the parameterization. On the other hand, both algorithms have satisfactorily resolved nonlinear optimization problems [17, 32, 6466]. Finally, simplified theoretical convergence models for both PSO [39] and CS [67] have been developed.

For the evaluation of the db-scan algorithm, instances E, F, G, and H, which correspond to the most difficult instances from Beasley’s OR library, were used. For the execution of the instances, we used a PC with Windows 10 and an Intel Core i7-8550U processor with 16 GB of RAM. The algorithm was programmed in Python 3.7. To perform the statistical analysis in this study, the nonparametric Wilcoxon signed-rank test and violin charts were used. The analysis was performed by comparing the dispersion, median, and interquartile ranges of the distributions.

5.1. Parameter Settings

To obtain the parameters necessary to generate the binary algorithms db-scan-PSO and db-scan-CS, the methodology proposed in [1, 2] was selected. This methodology uses 4 measures defined in equations (9) to (12) to determine the best configuration. To be able to compare the different configurations, there are 4 measures, which are located on a radar chart, and the area under the curve is calculated for each configuration. The configuration with the largest area is selected.(1)The percentage deviation of the best value obtained in the ten executions compared with the best known value (see equation (9))(2)The percentage deviation of the worst value obtained in the ten executions compared with the best known value (see equation (10))(3)The percentage deviation of the average value obtained in the ten executions compared with the best known value (see equation (11))(4)The convergence time for the best value in each experiment normalized according to equation (12)

For PSO, the coefficients and are set to 2. ω is linearly decreased from 0.9 to 0.4. For the parameters used by db-scan, the minimum number of neighbors () is estimated as a percentage of the number of particles (N). Specifically, if and , then the minimum number of neighbors for the point to be considered within a cluster is 5. To select the parameters, problems E.1, F.1, G.1, and H.1 were chosen. The parameter settings are shown in Tables 1 and 2. In both tables, the column labeled Value represents the selected value, and the column labeled Range corresponds to the set of scanned values.

5.2. Contribution of Db-Scan Binary Operator

This section attempts to understand the contribution of the db-scan operator when compared with two random operators. The random operator models the situation whereby the transition probability does not depend on the velocity of the particle, unlike the db-scan operator, where the velocity strongly influences the cluster in which it will be located. Two random operators were considered. The first case (naive) is whereby each point has the same probability of transition and therefore is independent of the velocity. In the experiment, two conditions were considered for the random operator. First, the operator -0.25 has a fixed probability of 0.25; second, the operator -0.5 uses a fixed transition probability of 0.5. To make the comparison, CS was used. The second random operator, -5, additionally includes the concept of clusters. In this second operator, 5 clusters are defined, where 5 corresponds to, on average, the clusters obtained by db-scan when executing the different instances. Subsequent to each cluster, a transition probability of the set is assigned without repetitions. Finally, each particle randomly assigns a cluster.

The results obtained using the operator are shown in Table 3 and Figure 2. When we analyzed the best and average indicators shown in the table, the superiority of the results obtained by - over those obtained by the -0.5 and -0.25 operators was observed. This difference is consistent in all instances. The Wilcoxon test indicates that the difference is significant. When analyzing the violin charts, we see that the dispersion, interquartile range, and median are substantially more robust when using the - operator. This experiment is a strong indicator that, in the binarization process, i.e., the assignment of a transition probability to a particle, it is critical to consider the behavior of the particle in the search space. This allows us to obtain better behaving methods.

For the operator, the results are shown in Table 4 and Figure 3. In this experiment, the PSO and CS algorithms were used. When we analyzed the results of the table, it is observed that - has a better behavior than in both algorithms. When analyzing Figure 3, it is observed that the median, interquartile range, and dispersion measures obtain better results with the - operator. Additionally, we should note that achieves a better performance than , which suggests that assigning random transition probabilities by groups is more appropriate than assigning them randomly.

5.3. K-Means Algorithm Comparison

K-means is another clustering technique that was used in [2] to binarize continuous swarm intelligence algorithms and applied to the knapsack problem. The objective of this section is to compare the behavior of the binarization used by db-scan with that used by k-means. The k-means technique, unlike db-scan, is necessary to define the number of clusters. On the other hand, the computational complexity of k-means once the number of clusters (k) and the dimension (d) of the points are fixed is , where n is the number of points to be clustered. The computational complexity of db-scan is . In this experiment, the quality of the solutions and their execution times are compared. For the case of k-means, k = 5. In the case of db-scan, the number of clusters is variable. For comparison, the same dataset as in the previous experiment is used. In Table 5, the results of the binarization for CS and PSO are shown using the k-means and db-scan operators. When we observe the best and average indicators, we see that their values are very similar for both the implementation with k-means and the implementation with db-scan. Moreover, when we use the Wilcoxon test, we see that the small differences are not significant. However, when we analyze the execution times, we see that db-scan improves the times obtained by k-means. When we compare the interquartile range and the dispersion shown in Figure 4, we see that the results are very similar. Considering that k-means handles a fixed number of clusters and given that, in the case of db-scan, this can be variable, the quality of the solutions is not affected significantly.

5.4. Transfer Function Comparison

In this section, we detail the experiments that allow us to evaluate the behavior of binarization using db-scan with respect to the TF. The TF is a general binarization mechanism that, instead of using the cluster concept to assign a transition probability, uses functions that map in the space . Usually, two families of functions are used: v-shape () and s-shape () functions. For more details about TFs, we recommended [32].

In our case, we used the v-shape function with the parameter for both the CS and PSO algorithms. The methodology used to determine the family and the parameter τ corresponds to the same detailed in Section 5.1. The results are shown in Table 6 and Figure 5. For the TFs, 2000 iterations were considered for the experiment. From Table 6, it is observed when analyzing the best and average indicators that the binary algorithms obtained through db-scan achieve better performance than those obtained with the TF. When performing the Wilcoxon test, the obtained differences are significant. When we look at Figure 5, we see that the dispersion and interquartile ranges are considerably improved when using db-scan. We should note that the TF assigns a particular value of the transition probability to each solution based on a function, unlike db-scan, uses assignment by groups of solutions.

6. Real-World Application

The crew scheduling problem (CSP) is related to building the work schedules of crews necessary to cover a planned timetable. The CSP is studied in operations research and is usually related to the airline industry, transit companies, and railways, among others. In this section, we are interested in using the binarizations obtained from applying the db-scan algorithm to the CSP.

The CSP, due to its difficulty, needs to be decomposed in several stages, where each stage has a given computational complexity. The literature contains variations of the CSP. These variations consider integration with other problems or the inclusion of new restrictions. For example, in the CPS in [68], attendance rates were studied. A CSP integration with the vehicle scheduling problem was developed in [69]. In [70], an application of the CSP with fairness preferences was explored. The crew pairing and fleet assignment problems were studied in [71].

The CSP starts with a timetable of services that must be executed with a certain frequency. On the other hand, the service needs to be executed in a certain time window. A service consists of a sequence of trips, where a trip has the following attributes: a start time, an end time, a departure station, an arrival station, and a crew that delivers the service. In terms of the above attributes, each trip is assigned a cost. When we consider a period of time and a crew, a roster must be generated. Then, the CSP consists of finding a subset of rosters that covers all trips at the minimum cost. The problem can be divided into two phases. The first phase corresponds to the generation of a pairing. A pairing is defined as a set of trips that are assigned to a single crew in a short period of time. In this pairing phase, a large number of pairings is generated that satisfy the constraints of the problem. A match must start and end at the same depot, and a cost must be associated. The second phase corresponds to the pairing optimization. At this stage, a selection is made of the best subset of all generated pairings to ensure that all trips are covered at a minimum cost. The modeling of this phase follows an approach based on the solution to set covering or set partitioning problems. In this work, we use a dataset on which the pairs were generated; therefore, we focus our efforts on performing the pairing optimization phase. To verify our algorithm, 7 datasets associated with real-world crew scheduling problems were used. These datasets come from an application from the Italian railways and have been provided by Ceria et al. [72]. Table 7 shows the datasets and their results. When we analyzed the table, we observed that although the problems were larger than the previous problems, the performances of the and — binarizations were adequate. In the case of , the gap for the best value was 0.52%, and, on average, it was 1.17%. For , the gap for the best value was 0.52%, and, on average, it was 1.08%.

7. Conclusions

In this article, an algorithm was proposed that uses the db-scan technique with the goal of binarizing continuous swarm intelligence metaheuristics. To evaluate the proposed algorithm, as a first step, two random operators were designed with the objective of identifying the contribution of db-scan to the binarization process. Subsequently, the proposed db-scan algorithm was compared with two binarization techniques. The first technique is based on the clustering concept and uses the k-means technique, where the number of clusters is fixed. The second technique uses TFs as a binarization mechanism. In the comparison with the binarization technique that uses the concept of i-means, the results were very similar. Those results were confirmed with the Wilcoxon test, which showed no significant differences between the two techniques. However, we must emphasize that the execution times of db-scan were shorter than those of k-means. One point to consider is that the different methods of generating the clusters do not affect the quality of the solutions. In the case of k-means, a fixed number of clusters, generated based on the proximity of the points, is defined. For db-scan, the number of clusters is variable and is generated based on the proximity and density of points. In comparison with the TFs, we observed that there is a significant difference in favor of db-scan. This suggests that it is more efficient in the binarization process to assign transition probabilities to groups than to assign them individually. The application of machine learning to metaheuristic algorithms is a line of research that has several aspects. We see that machine learning techniques can learn and help to understand under which conditions a metaheuristic algorithm performs efficiently. However, these techniques can be applied to other operators, such as perturbation operators, when a metaheuristic algorithm is trapped in a local minimum, and operators that control the population of a swarm intelligence algorithm to improve the intensification and diversification properties. Additionally, appealing to the no-free-lunch theorem, it would be interesting to evaluate these algorithms when including machine learning tuning applied to other combinatorial problems.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

José García was supported by the grant CONICYT/FONDECYT/INICIACION/11180056, Broderick Crawford was supported by the grant CONICYT/FONDECYT/REGULAR/1171243, and Ricardo Soto was supported by the grant CONICYT/FONDECYT/REGULAR/1190129.