Introduction

Multiple Instance Learning (MIL) is a type of weak supervision. It falls under the inexact supervision category of weak supervision where the data are given with labels but not as exact as desired. This type of data is prevalent in medical field where the class labels are often not available at desired granularity [1]. Hence, MIL is particularly well suited for medical data analysis [2].

MILBoost was first proposed by Viola et al. [3]. It was mainly developed for object detection in images and videos. There on, MILBoost and many of its variants are used for various tasks—human action recognition [4], MIL with gradient boosting for face recognition from videos [5], human detection from artificially generated 3D human models [6], multi-class MILBoost for human parts detection [7], logistic MILBoost for pedestrian detection [8], gentle MILBoost for human detection which uses the Newton update to get an optimal weak classifier [9], confidence rated MILBoost [10], online MILBoost [11], object tracking by incorporating instance significance estimation into online MILBoost [12], online MILBoost for object tracking [13,14,15], visual object tracking. In medical applications, MILBoost has been used for early temporal prediction of Type 2 diabetes risk condition [16], liver cirrhosis classification using ultrasound images [17], histopathology cancer image classification, segmentation, clustering [18,19,20]. The main concept behind boosting is to sequentially train several weak classifiers or weak estimators and combine them to form a strong classifier. This combining is done through weighted sum of the weak classifiers. For this, each weak classifier is assigned a weight. The main task here is to find the combination of optimized weights to generate the strongest classifier. MILBoost uses AnyBoost framework [21] where boosting classifier is trained by maximizing the log-likelihood of all bags. There is a scope of improving the MILBoost framework by enhancing the weight optimization process through population-based evolutionary technique instead of a single point gradient descent technique. This also opens up the scope for parallelizing the optimization process. Evolutionary algorithms—Genetic algorithm (GA) [22] and Differential Evolution (DE) [23] has been used for MIL to formulate pooling functions [24, 25].

The main objective of this work is to formulate a MILBoost framework based on differential evolution (DE) which will make the framework able to parallelize the optimization process.

The rest of the paper is divided into six sections. Section 2 elaborates on MILBoost. Section 3 gives a brief description of DE. Section 4 presents the methodology. Section 5 discusses the experiments done and the subsequent results are discussed in Sect. 6. Finally, Sect. 7 concludes the paper.

Multiple instance boosting (MILBoost)

This section presents the formal representation of MILBoost. Suppose we have a binary classification data \((X_{1} ,Y_{1} ),(X_{2} ,Y_{2} ),...,(X_{n} ,Y_{n} )\) where \(X_{i} = \{ x_{i1} ,x_{i2} ,...,x_{{{\text{im}}}} \}\), i ∈ {1,2,…,n}, n is number of bags, m is the dimension of Xi and Yi ∈ [0,1].Yi = 1 indicates that the positive bag Xi contains at least one positive instance xij (j = {1,2,…,m}). Yi = 0 means that there are no positive instances in the bag Xi. The task is to identify a real-valued function h(xij) to infer the instance label yij corresponding to an instance xij. This function is estimated through a weak classifier. Then, through boosting, weak classifiers are combined to form a strong classifier with low error

$$ H = \sum\limits_{k = 1}^{K} {\alpha_{k} y_{ij}^{k} = } \sum\limits_{k = 1}^{K} {\alpha_{k} h_{k} (x_{ij} )} $$
(1)

where K is the number of weak classifiers, αk are the classifier weights or estimator weights which signify the relative importance of a weak classifier. In each phase, incorrectly classified instances receive more weights.

In MILBoost, the probability of an instance being positive is

$$ p_{ij} = \frac{1}{{1 + \exp ( - y_{ij} )}} $$
(2)

The probability that a bag is positive is

$$ p_{i} = 1 - \prod\limits_{j = 1}^{m} {(1 - p_{ij} )} $$
(3)

The log-likelihood of all bags is

$$ L = \sum\limits_{i = 1}^{n} {(y_{i} \log (p_{i} ) + (1 - y_{i} )\log (1 - p_{i} ))} $$
(4)

The main task is to train the classifier by maximizing this log-likelihood function.

Differential evolution (DE)

Differential evolution (DE) is a population-based evolutionary metaheuristic technique; used for solving the complex structured optimization problem in many application areas. DE was initially proposed by Storn and Price [23] in 1996. For a more profound understanding, article [26] could be referred. In general, DE formulation is divided into two phases—initialization and evolution. Initialization phase comprises random population generation, and evolution phase consists of mutation, crossover and selection for generating the new population for next generation. The flowchart for DE is presented in Fig. 1.

Fig. 1
figure 1

Differential Evolution Flowchart

Operations in DE

Initialization

In this step, a set of the uniformly distributed random population is generated. These represent the initial solution points in the search space.

$$ X_{G} = (X_{1} ,X_{2} ,...,X_{N} ) $$
(5)
$$ X_{i} = (x_{1i} ,x_{2i} ,...,x_{Di} ) $$
(6)
$$ x_{ji} = lb + r_{ji} *(ub - lb) $$
(7)

where G is the number of generations, NP is the number of individuals in population, D is dimension of an individual, lb and ub are lower and upper bounds respectively, r ∈ [0,1] is random number, i ∈ {1,2,…,NP}, j ∈ {1,2,…,D}.

Mutation

After population generation, mutation is performed to expand the search space. In mutation strategy, for each target vector a corresponding mutant vector is generated. DE has various mutation strategies. In this paper, the DE/rand/1 strategy is used to generate mutant vector \(V_{i} = (v_{1i} ,v_{2i} ,...,v_{Di} )\)

$$ V_{i} = X_{{r_{1} }} + F*(X_{{r_{2} }} - X_{{r_{3} }} ) $$
(8)

where Vi is mutant vector, F ∈ (0,1.2] is scaling factor, X are individuals in population and r1,r2,r3 ∈ {1,2,…NP} where r1 ≠ r2 ≠ r3 ≠ i.

Crossover

Crossover is performed between target vector and mutant vector to increase the diversity of the population and to assimilate the best individual. After the crossover, trial vectors are generated. For a trial vector \(U_{i} = (u_{1i} ,u_{2i} ,...,u_{Di} )\) -

$$ u_{ji} = \left\{ {\begin{array}{*{20}l} {\begin{array}{*{20}c} {v_{ji} ,} & {\begin{array}{*{20}c} {{\text{if}}} & {r_{ji} \le CR \cup j = j_{r} } \\ \end{array} } \\ \end{array} } \\ {\begin{array}{*{20}c} {x_{ji} ,} & {{\text{otherwise}}} \\ \end{array} } \\ \end{array} } \right. $$
(9)

where CR ∈ [0,1] is crossover probability, r ∈ [0,1] is random number and jr ∈ {1,2,…,D}.

Selection

Tournament selection is done between the trial and the target vector and the one having a better fitness value move on to the next generation.

$$ X_{i,G + 1} = \left\{ {\begin{array}{*{20}l} {\begin{array}{*{20}c} {U_{i,G} ,} & {\begin{array}{*{20}c} {{\text{if}}} & {f(U_{i,G} ) \le f(X_{i,G} )} \\ \end{array} } \\ \end{array} } \\ {\begin{array}{*{20}c} {X_{i,G} ,} & {{\text{otherwise}}} \\ \end{array} } \\ \end{array} } \right. $$
(10)

where f() is the objective function.

Methodology

DE is used in the MILBoost framework to optimize the log-likelihood of all bags as defined in Eq. (4). So, the objective function for DE in this work is the log-likelihood function. A population of αk, the classifier weights as defined in Eq. (1) is randomly initialized. The algorithm for the proposed Evolutionary MILBoost (EMILBoost) is presented below while it is pictorially represented through flowchart in Fig. 2.

figure a
Fig. 2
figure 2

Evolutionary multiple instance boosting flowchart

As mentioned earlier, DE paves the path for parallelization of the optimization process. Unlike general optimization techniques, DE—a metaheuristic process approaches the optimal value from various directions. It generates multiple values in the search space as initial solutions and then converges towards the optimal point in the search space. Hence, rather than approaching from a single point, as DE approaches the problem from various directions, naturally parallelization will improve the process.

Experiments

Data

For this work, two classic MIL benchmark datasets are used—MUSK1 and MUSK2 [27] which are available in UCI Machine Learning Repository [1]. These correspond to the problem of predicting drug activity. A molecule has the desired drug effect if and only if one or more of its conformations bind to the target binding site. Since molecules can adopt multiple shapes, a bag is made up of shapes belonging to the same molecule. MUSK1 and MUSK2 contain 476 and 6598 instances respectively. MUSK2 is used for training data as it contains greater number of instances. MUSK1 is used as testing data. Both the datasets have total 168 attributes out of which 166 are features. The data attribute information is given in Table 1.

Table 1 Data description

Apart from the aforementioned datasets, a standarad binary classification dataset used to test boosting frameworks—Hastie_10_2 is used in this book [28] which is available in the scikit-learn dataset library [29]. The Hastie_10_2 dataset has 10 attributes \({X}_{1},{X}_{2},\dots .,{X}_{10}\) which are standard independent Gaussian variates. The class is defined as

$$y=\left\{\begin{array}{c}1\, if \sum_{i=1}^{10}{X}_{i}^{2}>9.34\\ 0\,otherwise\end{array}\right.$$
(11)

Experimental setup

Decision Tree classifiers with maximum depth of 1 are used as the weak classifiers. Log-sum-exp pooling is used for bag pooling. For, implementation smoothness, the negative of the log-likelihood function is taken and minimized. This is the same as maximizing the original function. Number of weak classifiers are varied to check its effects on training i.e. K ∈ {10,15,20,25,30,35,40}. DE parameters—Crossover probability = 0.7, Mutation strategy = best/1/bin, Scaling factor = [0.5, 1], number of generations = 1000, population size = 20.

Hardware and software specifications

Experiments have been conducted on Spyder 4.2.0 Integrated Development Environment (IDE) with Python 3.7.9 through Anaconda distribution on an Intel Xeon 2.5 GHz system with 16 GB RAM, Nvidia Quadro 2000 GPU and 64-bit Windows 10 Operating System.

Evaluation metrics

As this is a classification problem, hence the standard training error, testing error and confusion matrix of testing data are used as the evaluation metrics here. In MIL, we evaluate our model on the basis of bag classification accuracy. Therefore, bag training error, bag testing error are used here.

Results and discussions

The results of the proposed EMILBoost is compared with another two boosting frameworks—GentleBoost and LogitBoost [30]. Tables 2 and 3 records the testing and training errors for bags for the MUSK dataset. Bag testing and training error rate curves or convergence curves for different boosting frameworks are presented in Figs. 3 and 4 for MUSK and Hastie_10_2 datasets respectively. The training and testing error rate curves for different numbers of weak classifiers are presented in Figs. 5 and 6 for MUSK and Hastie_10_2 datasets respectively while Confusion matrix for different numbers of weak classifiers are presented in Fig. 7 for MUSK dataset.

Table 2 Bag test errors
Table 3 Bag train errors
Fig. 3
figure 3

Error rate curves (MUSK)

Fig. 4
figure 4

Error rate curves (Hastie_10_2)

Fig. 5
figure 5

Error rates for strong classifier consisting of a 10 weak classifiers, b 15 weak classifiers, c 20 weak classifiers, d 25 weak classifiers, e 30 weak classifiers, f 35 weak classifiers, g 40 weak classifiers for MUSK dataset

Fig. 6
figure 6

Error rates for strong classifier consisting of a 10 weak classifiers, b 15 weak classifiers, c 20 weak classifiers, d 25 weak classifiers, e 30 weak classifiers, f 35 weak classifiers, g 40 weak classifiers for Hastie_10_2 dataset

Fig. 7
figure 7

Confusion Matrix for strong classifier consisting of a 10 weak classifiers, b 15 weak classifiers, c 20 weak classifiers, d 25 weak classifiers, e 30 weak classifiers, f 35 weak classifiers, g 40 weak classifiers for MUSK dataset

From Tables 2 and 3, it is clear that EMILBoost achieves the lowest errors and hence outperforms GentleBoost and LogitBoost. Figures 3 and 4 also establishes the supremacy of EMILBoost.

From Fig. 5 and 6, it can be easily inferred that increasing the number of weak classifiers improve the learning process i.e. corresponds to lesser error.

The upper left block of confusion matrix signifies the True Positives (TP), lower right signifies True Negatives (TN) while lower left signifies False Positives (FP) and upper right signifies False Negatives (FN). Main aim of a classifier is to obtain more numbers of TP + TN and lesser numbers of FP + FN. From Fig. 7, it can be easily inferred that for EMILBoost framework, \(TP+TN>FP+FN\). Hence, the framework is performing as desired.

Conclusion

The main aim of this paper was to enhance the MILBoost framework through DE, a population-based evolutionary metaheuristic method by optimizing the weak classifier weights. DE also paves the path to parallelize this optimization process. The results show that the proposed EMILBoost outperforms GentleBoost and LogitBoost. Increasing the number of weak classifiers improves the learning process while on the other hand it increases the learning time. A trade-off between these two is needed through optimizing the number of weak classifiers which is a multi-objective problem. This can be regarded as the future extension of this work.