1 Introduction

Supervised data classification is an important task in data mining. Given input-output examples the aim of supervised data classification is to find a function that maps input data to output data. Such functions, called also classifiers, are computed using classification algorithms. There exist many classifiers based on statistical, decision tree, neural networks and optimization approaches. Among these classifiers piecewise linear classifiers have shown to be effective. They require very low memory and a little testing time. Therefore, these classifiers have very specific applications where other classifiers can not be applied. These applications include small reconnaissance robots, autonomous mobile robots, intelligent cameras, imbedded and real-time systems, portable devices, industrial vision systems, automated visual surveillance systems, monitoring systems (Kostin 2006). In all these systems classifiers must learn without any human intervention.

Piecewise linear classifiers have been a subject of study for more than three decades. To find a piecewise linear boundary between pattern classes is a difficult global optimization task. The objective function in this optimization problem is nonconvex and nonsmooth. The training of piecewise linear classifiers may take a long time. Existing piecewise linear classifiers can be divided into two groups. The first group contains classifiers in which each segment of the piecewise linear boundary is trained locally (multiple optimization approach). The second group contains classifiers in which the problem of finding a piecewise linear boundary is formulated as a single optimization problem (single optimization approach). Piecewise linear classifiers based on multiple optimization approaches were developed in Gasimov and Ozturk (2006), Kostin (2006), Park and Sklansky (1989), Schulmeister and Wysotzki (1994), Sklansky and Michelotti (1980), Tenmoto et al. (1998). Such classifiers based on single optimization approach were developed in Astorino and Gaudioso (2002), Bagirov (2005) (see also Astorino et al. 2008; Carrizosa and Romero Morales 2013 for review of such classifiers).

Recently incremental approach has become increasingly popular to design classifiers. Note that there are two types of incremental algorithms in supervised data classification. In the first type of algorithms data points incrementally are added and a classifier is updated accordingly. In contrast to this type, in the second type of algorithms data sets are fixed and decision boundaries between classes are built incrementally. Such incremental classifiers were introduced in Bagirov et al. (2011a, b, 2013).

In this paper, we design a new incremental piecewise linear classifier using polyhedral conic separability. Although polyhedral conic functions (PCF) are more complex than linear functions they can provide better approximation to nonlinear boundaries between classes with less number of conic functions. Furthermore linear functions are special case of polyhedral conic functions. Therefore, one can expect that the use of polyhedral conic functions will lead to better approximation of nonlinear boundaries between classes.

In order to find piecewise linear boundaries between classes using PCFs, we introduce a classification error function which is nonsmooth and nonconvex. We also introduce an auxiliary function and design a special procedure to generate starting points by minimizing it. The proposed algorithm is based on the single optimization approach. We test the incremental algorithm using several publicly available large data sets and compare it with some mainstream classifiers.

This paper is organized as follows. In Sect. 2 some preliminaries about PCFs are provided. An auxiliary function is introduced and an algorithm for finding starting points is described in Sect. 3. Section 4 presents an incremental PCF algorithm and its implementation. Results of numerical experiments are reported in Sect. 5. Finally, Sect. 6 contains some concluding remarks.

2 Preliminaries: separation via polyhedral conic functions

In this section we briefly describe the notions of polyhedral conic functions and polyhedral conic separation. More detailed description can be found in Gasimov and Ozturk (2006). The notion of conic separation is also discussed in Astorino et al. (2012), Kasimbeyli (2009), Kasimbeyli (2010), Kasimbeyli and Mammadov (2009).

Let \(A\) and \(B\) be given disjoint sets in \(R^n\) containing \(m\) and \(p\) points, respectively:

$$\begin{aligned} A&= \{a^1,\dots ,a^m\}, a^i \in R^n, i=1,\dots ,m,\\ B&= \{b^1,\dots ,b^p\}, b^j \in R^n, j=1,\dots ,p. \end{aligned}$$

Polyhedral conic functions (PCFs) have recently been proposed in Gasimov and Ozturk (2006) to construct a separation function for the sets \(A\) and \(B\).

Definition 1

(Gasimov and Ozturk 2006) A function \(g:R^n \rightarrow R\) is called polyhedral conic if its graph is a cone and all its level sets

$$\begin{aligned} S(\alpha ) = \left\{ x \in R^{n}:g(x)\le \alpha \ \right\} , \end{aligned}$$
(1)

for \(\alpha \in R\), are polyhedral sets.

Example 1

Consider the following function

$$\begin{aligned} g(x_1,x_2)= 0.11 (x_1-2) + 0.11(x_2-2)+0.33(|x_1-2|+|x_2-2|)-1. \end{aligned}$$

This function is polyhedral conic and its graph and level set are illustrated in Figs. 1 and 2, respectively.

Fig. 1
figure 1

The graph of the polyhedral conic function \(g\)

Fig. 2
figure 2

The level set of the polyhedral conic function \(g\)

Given \(w, c \in R^n, \xi , \gamma \in R\) a polyhedral conic function \(g_{(w,\xi ,\gamma ,c)}\): \(R^{n}\rightarrow R\) is defined as follows:

$$\begin{aligned} g_{(w,\xi ,\gamma , c)}(x) = \langle w, x-c \rangle +\xi \left\| x-c \right\| _{1}-\gamma , \end{aligned}$$
(2)

where \(\Vert x\Vert _{1}=|x_1|+\cdots +|x_n|\) is an \(l_1\)-norm of the vector \(x \in R^n\) and \(\langle \cdot , \cdot \rangle \) is an inner product in \(R^n\).

Lemma 1

(Gasimov and Ozturk 2006) A graph of the function \(g_{(w,\xi ,\gamma ,c)}\) defined in (2) is a polyhedral cone with a vertex at \((c,-\gamma ) \in R^n \times R\).

The sets \(A\) and \(B\) are polyhedral conic separable if there exist a finite number of PCFs \(g_l=g_{(w^l,\xi ^l ,\gamma ^l,c^l)}, ~l=1,\ldots ,L\) such that

$$\begin{aligned} \min _{l=1,\ldots ,L} g_l(a) \le 0 ~~\forall a \in A \end{aligned}$$

and

$$\begin{aligned} \min _{l=1,\ldots ,L} g_l(b) > 0 ~~\forall b \in B. \end{aligned}$$

The way a PCF can separate two sets \(A\) and \(B\) in \(R^2\) is shown in Fig. 3. In this figure the set \(A\) is shown by red squares and the set \(B\) by blue balls. It is obvious that these two sets are linearly inseparable, however the constructed polyhedral conic function can completely separate them.

Fig. 3
figure 3

Separation using a polyhedral conic function

An algorithm generating a polyhedral conic separating function, therefore called a PCF algorithm, is developed in Gasimov and Ozturk (2006).

An error function for polyhedral conic separation of \(A\) and \(B\) can be formulated as follows (Bagirov et al. 2013):

$$\begin{aligned}&\varPhi (w^1,c^1,\xi ^1,\gamma ^1,\ldots ,w^L,c^L,\xi ^L,\gamma ^L) =\nonumber \\&\quad \frac{1}{m} \sum _{a \in A} \max \left\{ 0, \min _{l=1,\ldots ,L} g_l(a) \right\} + \frac{1}{p} \sum _{b \in B} \max \left\{ 0, -\min _{l=1,\ldots ,L} g_l(b) \right\} . \end{aligned}$$
(3)

Then the problem of finding polyhedral conic functions separating sets \(A\) and \(B\) is reduced to the following mathematical programming problem:

$$\begin{aligned} \mathrm{minimize} ~\varPhi (w^1,c^1,\xi ^1,\gamma ^1,\ldots ,w^L,c^L,\xi ^L,\gamma ^L) \end{aligned}$$
(4)

subject to

$$\begin{aligned} w^i, c^i \in R^n,~\xi ^i, \gamma ^i \in R,~i=1,\ldots ,L. \end{aligned}$$
(5)

The objective function in Problem given by (4)–(5) is nonsmooth and nonconvex for any \(L \ge 1\). It may have many local minimizers and the number of local minimizers increases as the number of polyhedral conic functions \(L\) increases. However, global minimizers of Problem given by (4)–(5) are of interest since they provide the least number of PCFs separating the sets \(A\) and \(B\) with maximum accuracy. The number of variables in Problem given by (4)–(5) is \(2(n+1)L\) and becomes large as the number \(L\) of PCFs increases. Such problems are out of reach for many existing global optimization techniques and the finding of global minimizers of Problem given by (4)–(5) can become very time consuming.

Therefore, it is the aim of this paper to design an algorithm which is able to find either global or near global minimizers of Problem given by (4)–(5). This algorithm involves a special procedure for generating “promising” starting points which is crucial when one applies a local search method to minimize the function \(\varPhi \). This procedure is given in the next section.

3 Computation of starting points

In this section we propose an algorithm for finding starting points for solving Problem given by (4)–(5). First we introduce the auxiliary function.

Assume that the solution \(g_1,\ldots ,g_k\) to Problem given by (4)–(5) for \(L=k \ge 1\) is known. This means that we have \(k\) quadruples \((w^l,c^l,\xi ^l,\gamma ^l), l=1,\ldots , k\). Then the function separating sets \(A\) and \(B\) can be expressed as follows:

$$\begin{aligned} G_k(x)=\min _{l=1,\ldots ,k} g_l(x). \end{aligned}$$
(6)

The value of global minimum for the problem given by (4)–(5) is:

$$\begin{aligned} \varPhi _{k,min}=\frac{1}{|A|}\sum _{a \in A} \max \{0, G_k(a)\}+\frac{1}{|B|}\sum _{b \in B} \max \{0, -G_k(b)\}, \end{aligned}$$

where \(|C|\) is the cardinality of a set \(C\). Using the function \(G_k\) we can divide the class \(A\) into two subsets: the subset of correctly classified points \(A^{cc}\) and the subset of misclassified points \(A^{mc}\). These sets can be described as

$$\begin{aligned} A^{mc}=\{a \in A: G_k(a)>0 \} \end{aligned}$$
(7)

and

$$\begin{aligned} A^{cc} = A \setminus A^{mc}. \end{aligned}$$

Let \(\bar{g}_{(w,c,\xi ,\gamma )}\) be a polyhedral conic function depending on a set of parameters \((w,c,\xi ,\gamma ), w, c \in R^n,~\xi , \gamma \in R\). Define the function

$$\begin{aligned} \varphi _k(w,c,\xi ,\gamma )&= \frac{1}{|A|}\sum _{a \in A} \max \left\{ 0, \min \{G_k(a), \bar{g}_{(w,c,\xi ,\gamma )}(a)\} \right\} \nonumber \\&+ \frac{1}{|B|}\sum _{b \in B} \max \left\{ 0, -\min \{G_k(b), \bar{g}_{(w,c,\xi ,\gamma )}(b) \}\right\} \end{aligned}$$
(8)

which is called the \(k\) -th auxiliary function. This function is nonsmooth and nonconvex. The following problem is called the \(k\) -th auxiliary problem:

$$\begin{aligned} \mathrm{minimize} ~\varphi _k(w,c,\xi ,\gamma ) \qquad \text {subject to} \qquad w, c \in R^n, \xi , \gamma \in R. \end{aligned}$$
(9)

Consider the following set

$$\begin{aligned} C_k=\left\{ (w,c,\xi ,\gamma ): ~\bar{g}_{(w,c,\xi ,\gamma )}(a) \ge G_k(a), \quad \forall a \in A \right\} . \end{aligned}$$
(10)

It is obvious that on this set the function \(\varphi _k\) is constant. Moreover,

$$\begin{aligned} \varphi _k(w,c,\xi ,\gamma ) = \varPhi _{k,min} = \max \left\{ \varphi _k(\bar{w},\bar{c},\bar{\xi },\bar{\gamma }): \bar{w},\bar{c} \in R^n, \bar{\xi }, \bar{\gamma } \in R \right\} , \forall (w,c,\xi ,\gamma ) \in C_k \end{aligned}$$

and the function \(\varphi _k\) does not depend on variables \(w,c,\xi ,\gamma \) in \(C_k\). Therefore, its subdifferential at any point from this set contains the origin meaning that all these points are stationary points of the function \(\varphi _k\). Any local method may terminate if it starts from these points and therefore such points cannot be considered as good starting points for solving the Problem (9). Consequently we will choose starting points from the complement \(\bar{C}_k\) of the set \(C_k\) defined as:

$$\begin{aligned} \bar{C}_k=R^n \setminus C_k = \left\{ (w,c,\xi ,\gamma ): \exists a \in A ~\text {such that} ~\bar{g}_{(w,c,\xi ,\gamma )}(a) < G_k(a)\right\} . \end{aligned}$$
(11)

It is clear that the value of the function \(\varphi _k\) at any point from \(\bar{C}_k\) is strictly less than its maximum value \(\varPhi _{k,min}\) and can be used as a starting point to minimize this function. However, not all these points guarantee decrease of the number of misclassified points and sufficient decrease of the value of \(\varphi _k\) comparing with \(\varPhi _{k,min}\). In order to achieve such decrease we propose to use misclassified points from \(A\) as a center \(c\) in the function \(\bar{g}\). In this case the new function will correctly classify at least one misclassified point which is the center. An algorithm generating starting points proceeds as follows.

Algorithm 1 Computation of starting points for solving Problem (9).

Input The data set with two classes \(A\) and \(B\), \(k\) PCFs with the parameters \((w^l,c^l,\xi ^l,\gamma ^l), l=1,\ldots ,k\), a number \(u \in [0,1]\) and a sufficiently small number \(\varepsilon > 0\).

Output The set \(P\) of starting points for solving Problem (9).

Step 1 (Computing a new PCF) Take each misclassified point \(a \in A^{mc}\) as a center and consider \(k\) new PCFs with the parameters \((w^l,a,\xi ^l,\gamma ^l), l=1,\ldots , k\). Find the value \(\bar{\gamma }^l\) of \(\gamma ^l\) satisfying the following inequalities

$$\begin{aligned} \bar{g}_{(w^l,a,\xi ^l,\bar{\gamma }^l)}(b) > 0, \quad \forall b \in B. \end{aligned}$$

To find such a \(\bar{\gamma }^l\) compute

$$\begin{aligned} \eta ^l = \min _{b \in B} \bar{g}_{(w^l,a,\xi ^l,\gamma ^l)}(b) \end{aligned}$$

and define \(\bar{\gamma }^l\) as follows:

$$\begin{aligned} \bar{\gamma }^l = \gamma ^l + \eta ^l + \varepsilon . \end{aligned}$$

Step 2 (Computation of maximum decrease for each misclassified point). For each \(a \in A^{mc}\) calculate the decrease in the minimum value of the error function \(\varPhi _k\) when adding the new PCF \(\bar{g}_{(w^l,a,\xi ^l,\bar{\gamma }^l)}\):

$$\begin{aligned} D_l(a) = \sum _{x \in A^{mc}} \left[ \max \left\{ 0, G_k(x)-\max \{\bar{g}_{(w^l,a,\xi ^l,\bar{\gamma }^l)}(x),0\}\right\} \right] . \end{aligned}$$

Compute

$$\begin{aligned} D_{max}(a) = \max _{l=1,\ldots ,k} D_l(a) \end{aligned}$$

and

$$\begin{aligned} (\bar{w}, a,\bar{\xi }, \bar{\gamma }) = \underset{l=1,\ldots ,k}{\text {argmax~}} D_l(a). \end{aligned}$$

Step 3 (Finding candidates for starting points) Compute the maximum decrease among all misclassified points from the set \(A^{mc}\):

$$\begin{aligned} \bar{D}_{max} = \max _{a \in A^{mc}} D_{max}(a). \end{aligned}$$

Define the set of best candidate centers and parameters as follows:

$$\begin{aligned} \bar{A}^{mc}=\left\{ a\in A^{mc} : D_{max}(a) \ge u \bar{D}_{max} \right\} ,\\ P=\left\{ (\bar{w},\bar{a},\bar{\xi }, \bar{\gamma }): \bar{a} \in \bar{A}^{mc} \right\} . \end{aligned}$$

Points from the set \(P\) are starting points for solving Problem (9).

In Step 1, we take each misclassified point \(a\in A^{mc}\) as a center and compute \(k\) new PCFs using the solution to Problem given by (4)–(5) for \(L=k\). We update the parameter \(\gamma \) to ensure that the level sets \(S_l(0), l=1,\ldots ,k\) of the new PCFs, computed by (1), do not contain any point from the set \(B\). If \(\eta < 0\) then this set contains points from \(B\). This case is illustrated in Fig. 4. Here the set \(A\) is described by blue squares and the set \(B\) by red balls. In this case we decrease \(\gamma \) by adding \(\eta <0\) and make the sets \(S_l(0)\) smaller, so that all points from \(B\) lie outside of \(S_l(0), l=1,\ldots ,k\). It is obvious that in this case \(|\eta | < \gamma \) due to the fact that \(a\in S_l(0), l=1,\ldots ,k\) (see, for details Gasimov and Ozturk 2006).

Fig. 4
figure 4

Computing \(\eta \) in Step 1 of Algorithm 1: Case \(\eta < 0\)

If \(\eta >0\) the sets \(S_l(0), l=1,\ldots ,k\) do not contain any point from \(B\). This case is illustrated in Fig. 5. In this case we increase \(\gamma \) by adding \(\eta > 0\) and make the sets \(S_l(0)\) larger, so that all points from \(B\) lie outside of \(S_l(0), l=1,\ldots ,k\). This may lead to the decrease of the number of misclassified points from the set \(A^{mc}\). Figure 6 illustrates update of the parameter \(\gamma \).

Fig. 5
figure 5

Computing \(\eta \) in Step 1 of Algorithm 1: Case \(\eta > 0\)

Fig. 6
figure 6

Updating \(\gamma \) in Step 1 of Algorithm 1

In Step 2, for each \(a\in A^{mc}\) we choose the PCF among \(k\) PCFs which provides the largest decrease of the error function. In Step 3, the set of best candidates to be starting points are chosen. If \(u=0\) then \(\bar{A}^{mc}=A^{mc}\) that is we select all PCFs computed in Step 3. If \(u=1\) then we select only PCFs providing largest decrease of the error function.

4 Incremental algorithm

In this section we design an incremental algorithm for solving Problem given by (4)–(5).

4.1 The description of the algorithm

Algorithm 2 Incremental algorithm for solving Problem given by (4)–(5).

Input The data set with two classes \(A\) and \(B\).

Output The set of PCFs separating the sets \(A\) and \(B\).

Step 1 (Initialization) Select the numbers \(u \in [0,1], v \in [1,\infty )\), the maximum number of PCFs \(K_{max} >0\) and tolerances \(\varepsilon _1, \varepsilon _2 >0\).

Step 2 (Computation of the first PCF) Compute the centroid of the set \(A\) and a point \(a^1 \in A\) closest to this centroid. Select the point \(c^1=a^1\) as the center of the first PCF and solve Problem given by (4)–(5) for \(L=1\) to find the first PCF \(g_1=g_{(w^1,c^1,\xi ^1,\gamma ^1)}\). Define \(G_1\) by (6) and compute the set \(A^{1,mc}=A^{mc}\) using (7). Set \(k:=1\).

Step 3 (Stoping criteria) If one of the following conditions

  1. 1.

    \(k>K_{max}\);

  2. 2.

    \(A^{k,mc}=\emptyset \);

  3. 3.

    \(\varPhi (w^1,c^1,\xi ^1,\gamma ^1,\ldots ,w^k,c^k,\xi ^k,\gamma ^k) \le \varepsilon _1\);

  4. 4.

    \(k \ge 2\) and

    $$\begin{aligned} \frac{\varPhi (w^1,c^1,\xi ^1,\gamma ^1,\ldots ,w^{k-1},c^{k-1},\xi ^{k-1},\gamma ^{k-1})- \varPhi (w^1,c^1,\xi ^1,\gamma ^1,\ldots ,w^k,c^k,\xi ^k,\gamma ^k)}{\varPhi (w^1,c^1,\xi ^1,\gamma ^1)} \le \varepsilon _2 \end{aligned}$$

is satisfied then the algorithm terminates. Otherwise set \(k:=k+1\).

Step 4 (Finding the starting points for the auxiliary problem (9)).

Apply Algorithm 1 using the set \(A^{mc}=A^{k-1,mc}\). This algorithm generates a set of starting points \(P\).

Step 5 (Solving auxiliary problem (9)).

Take each \((w,a,\xi ,\gamma ) \in P\) as a starting point and solve the auxiliary problem (9) and find a local minimizer \((\tilde{w},\tilde{a},\tilde{\xi },\tilde{\gamma })\). Denote by \(\tilde{P}\) the set of all such local minimizers.

Step 6 (Finding a set of starting points for the \(k\) -th PCF in Problem given by (4)–(5) ).

Compute

$$\begin{aligned} \bar{\varphi }_k= \min \left\{ \varphi _k(w,c,\xi ,\gamma ): (w,c,\xi ,\gamma ) \in \tilde{P}\right\} \end{aligned}$$

and define the set of starting points as follows:

$$\begin{aligned} \hat{P} = \left\{ (w,c,\xi ,\gamma ) \in \tilde{P}: \varphi _k(w,c,\xi ,\gamma ) \le v \bar{\varphi }_k \right\} . \end{aligned}$$

Step 7 (Solving Problem given by (4)–(5)).

For each \((w,a,\xi ,\gamma ) \in \hat{P}\) set \((w^k,a^k,\xi ^k,\gamma ^k):=(w,a,\xi ,\gamma )\), take \((w^l,a^l,\xi ^l,\gamma ^l), l=1,\ldots ,k\) as a starting point and solve Problem given by (4)–(5). As a result we get \(|\hat{P}|\) number of solutions to this problem. Select a solution with least value of the error function as a solution to Problem given by (4)–(5). Go to Step 3.

4.2 Discussion on Algorithm 2 and its implementation

Algorithm 2 computes a piecewise linear boundary between the sets \(A\) and \(B\) using polyhedral conic functions. It is an incremental algorithm. This algorithm starts with the computation of one PCF (Step 2) and adds one PCF at each iteration until the separation is obtained with respect to some tolerance. The algorithm involves solving of two optimization problems at each iteration: the auxiliary problem (9) (Step 5) and the problem given by (4)–(5) (Step 7). Both problems are nonsmooth and nonconvex. Therefore the choice of starting points is crucial if one applies a local method to solve them.

Starting points for solving the auxiliary problem (9) are computed applying Algorithm 1 (Step 4). This algorithm uses misclassified points from the set \(A\) as centers for polyhedral conic functions. Moreover, it uses a threshold to determine the most “promising” starting points. Starting points for solving the problem given by (4)–(5) are found by solving the auxiliary problem (9) (Steps 5 and 6). In Step 5 several local minimizers of the auxiliary problem is computed and in Step 6 the best auxiliary function value is computed among all obtained local minimizers. Then using a threshold local minimizers which provide sufficient decrease of the auxiliary function are chosen to be starting points for the \(k\)-th PCF. These starting points are used in Step 7 to solve the problem given by (4)–(5).

Algorithm 2 has four stopping criteria given in Step 3. The first stopping criterion is used to restrict the number of PCFs separating two sets. This allows one to avoid possible overfitting. The second stopping criterion in Step 3 means that with the given number of PCFs a perfect separation of the set \(A\) from the set \(B\) is achieved. The third stopping criterion means that a satisfactory separation is achieved with respect to a given tolerance. Finally, the fourth stopping criterion implies that adding more PCFs will not improve the separation quality and may lead to overfitting.

In order to implement Algorithm 2, one has to select the maximum number \(K_{max}\) of PCFs for separating sets \(A\) and \(B\). The value of this number depends on the size of a data set. For small data sets this number should be small (between 2 and 4) whereas for large data sets it can be large (between 5 and 10). Such a choice can help to prevent possible overfitting. In Step 3 we compute the centroid of the set \(A\) and a point from this set closest to the centroid. This point is selected as a center for the first PCF. In some data sets large neighborhood of the centroid may not contain any point from this data set. In such situations if one chooses the centroid as a center of a PCF, such PCF may not separate even one point from the set \(A\). Therefore, it is preferable to choose a data point as a center. If the center is fixed then the problem given by (4)–(5) is a linear programming problem for one PCF and any linear programming solver can be applied to solve it.

Algorithm 2 has four stopping criteria given in Step 3. First one is about the restriction on the number of PCFs. In the implementation of the algorithm we choose \(K_{max}=5\). The second criterion is applied when there is no misclassified point in the set \(A\) which means that the set \(A\) is perfectly separated from the set \(B\) with the given number of PCFs. The third stopping criterion is satisfied when the sets \(A\) and \(B\) can be separated with the given number of PCFs within the tolerance \(\varepsilon _1\). We choose \(\varepsilon _1 = 0.001\) to allow the algorithm to reach the separation with high accuracy. The fourth criterion is applied when the adding new PCF does not lead to any significant improvement in separation of these two sets. We choose \(\varepsilon _2 = 0.01\) to get significant improvement in separation of the sets at each iteration of the algorithm.

In Step 4 Algorithm 1 is applied to find starting points for solving the auxiliary problem (9). This algorithm requires the parameter \(u \in [0,1]\). Small values of \(u\) will lead to finding large number of starting points for solving the auxiliary problem. The use of a large number of starting points will make the solution of the auxiliary problem time consuming. On the other side this allows to find either global or near global solution of the auxiliary problem. Values of \(u\) close to 1 will allow us to reduce the number of starting points and to choose only those with the decrease of the error function close to the largest decrease. In all data sets we choose \(u=1\). In Step 5 we find the set of local minimizers of the auxiliary problem (9) starting from the points found in Step 4.

In Step 6 the least value of the auxiliary function is computed among all local minimizers found in Step 5. Then using this value the set of starting points for the next PCF to be added to the separating function is computed. In order to compute this set the value of the parameter \(v\) should be given. The number of starting points depends on this value. If \(v=1\) then one chooses the local minimizers with the least value of the auxiliary function. If the value of \(v\) is significantly large then all local minimizers of the auxiliary problem found in Step 5 are used as the starting points for the next PCF. In our numerical experiments we choose \(v=1\) that is we use only best local minimizers as starting points.

The solution to the problem given by (4)–(5) is found in Step 7. In order to compose starting points for solving this problem local minimizers of the auxiliary problem chosen in Step 6 is added to the \(k-1\) PCFs found at the previous iteration of the incremental algorithm. Thus, the number of starting points is the number of local minimizers of the auxiliary problem. The problem given by (4)–(5) is solved starting from each of these points and the solution with the least value of the error function is accepted as a solution to Problem given by (4)–(5). Such an approach allows to find either global or near global solution of this problem.

The objective functions in problems given by (4)–(5) and (9) are nonsmooth and nonconvex. The computation of subgradients of such functions is not an easy task. We propose to apply the discrete gradient method to solve these problems. Details of this method can be found in Bagirov (2003); Bagirov et al. (2008). The discrete gradient method does not require the exact calculation of subgradients, uses only values of a function to approximate subgradients and applicable for solving nonsmooth nonconvex optimization problems. The objective function in both problems are piecewise partially separable. Therefore we apply the version of the discrete gradient method for such problems introduced in Bagirov and Ugon (2006).

In order to solve multi-class data classification problems we use the one-vs-all strategy. This means that for given data set A with \(q \ge 2\) classes \(A_1,\ldots ,A_q\) we take any class \(A_j, j \in \{1,\ldots ,q\}\) as the set \(A\) and define the set \(B\) as a union of all remaining classes. Algorithm 2 generates necessary number of PCFs for each class \(A_j\) and this number can be different for different classes. The final separating function for each class is defined as a pointwise minimum of its PCFs. The classification rule is defined as follows. For every new data point (observation) the values of final separating functions for all classes are calculated and the point is classified to the class whose separating function has a minimum value. We implement the proposed algorithm in Fortran 77 and compile it using the gfortran compiler.

5 Numerical results

We tested the proposed algorithm—Incremental Polyhedral Conic Separation (IPCS) algorithm on medium sized and large scale real world data sets readily available from the UCI machine learning repository (Bache and Lichman 2013). The selected data sets contain either continuous or integer attributes and have no missing values. Table 1 presents a brief description of the data sets. It contains the number of data points in training and test sets, the number of attributes and classes in each data set.

Table 1 Brief description of data sets

In our experiments we used some classifiers from WEKA (Waikato Environment for Knowledge Analysis, Version 3.7.10) for a comparison. WEKA is a popular machine learning suite for data mining tasks written in Java and developed at the University of Waikato, New Zealand (see, for details, Hall et al. 2009). We choose representatives of different types of classifiers from WEKA: Naive Bayes (with kernel) (NB kernel), Logistic, Multi-Layer Perceptron (MLP), support vector machines classifiers Linear LibSVM (LibSVM (LIN)), LibSVM with polynomial kernel (LibSVM(POL)), LibSVM with RBF kernel (LibSVM(RBF)), SMO with normalized polynomial kernel (SMO (NPOL)) and SMO (PUK), a decision tree classifier J48 (which is an implementation of the C4.5 algorithm) and a rule based classifier PART. We apply all classifiers from WEKA with the default parameter values, except LibSVM(POL) and LibSVM(RBF) classifiers. In these two classifiers a special procedure is applied to estimate optimal values of parameters. We also include two piecewise linear classifiers: the incremental max-min separability algorithm (CIMMS) from Bagirov et al. (2011b) and the Hybrid Polyhedral Conic and Max–min Separability (HPCAMS) classifier from Bagirov et al. (2013) in our experiments.

Numerical experiments were carried on a PC Intel(R) Core(TM) i5-3470S CPU 2.90 GHz with 8GB of RAM running Windows 7. In the tables a dash line shows that an algorithm requires more memory than available.

Results for test set accuracies on different data sets using different classifiers are given in Tables 2 and 3. The proposed classifier achieves best test accuracy in two data sets: Abalone and Spambase data sets. In two data sets: Shuttle Control and Texture_ CR this classifier’s test accuracy is close to the best accuracy achieved by all used classifiers. In all other data sets its accuracy is reasonably high in comparison with other classifiers. This means that the proposed classifier produces stable results across different data sets.

Table 2 Test set accuracy for different classifiers
Table 3 Test set accuracy for different classifiers (cont.)

We say that two classifiers produce the similar test set accuracy if the difference between their accuracies is within the range of \(0.5~\%\). Table 4 presents pairwise comparison of the IPCS classifier with others using testing accuracy. These results clearly demonstrate that the proposed classifier produces better or similar test set accuracy in most data sets in pairwise comparison with other classifiers, except the LibSVM(RBF) classifier. Furthermore, the IPCS classifier generates better test set accuracy than others, except the LibSVM(RBF) and SMO(PUK) classifiers, in at least half of data sets.

Table 4 Pairwise comparison of the IPCS classifier with others using test set accuracy

Tables 5, 6, 7 present the total training (\(t_{train}\)) and testing (\(t_{test}\)) time used by the classifiers. Note that these results were obtained using different platforms (WEKA and Fortran 77), however they give some estimation on training and testing time required by the classifiers. One can see that with a few exceptions all three piecewise linear classifiers require (sometimes significantly) more training time than other classifiers. However testing time required by the piecewise linear classifiers is significantly less than that of support vector machines classifiers and also the Naive Bayes classifier. Logistic and MLP classifiers, in general, require more testing time than the piecewise linear classifiers. Testing time used by the rule-based classifier PART and the decision tree based classifier J48 is similar to that of used by the piecewise linear classifiers.

Table 5 Comparison of classifiers using training and testing time
Table 6 Comparison of classifiers using training and testing time (cont)
Table 7 Comparison of classifiers using training and testing time (cont)

Table 8 contains the total and average (per class) number of PCFs used by the IPCS algorithm to separate classes. One can see that this classifier uses in average a few polyhedral conic functions to separate each class from the rest of a data set. Therefore, in testing phase the memory usage of this classifier is very low.

Based on the results presented in Tables 5, 6, 7 and 8 we can conclude that although the proposed classifier uses significantly more training time than most of classifiers however it requires a little testing time and its memory usage is very low. Since in many applications training of classifiers can be done off-line, the proposed classifier can be considered as an accurate and efficient real time classifier. This makes it applicable in many real world applications listed in introduction section of this paper.

Table 8 The total and average number of PCFs used by the IPCS classifier

6 Conclusions

In this paper, the piecewise linear classifier is designed to solve supervised data classification problems. This classifier uses several polyhedral conic functions to separate classes and is based on the incremental approach. More specifically, it starts with one polyhedral conic function and adds a new polyhedral conic function at each iteration until the satisfactory separation of classes is achieved. Polyhedral conic functions are computed by minimizing the error function which is nonsmooth and nonconvex. The calculation of subgradients of the error function is not an easy task. We apply the discrete gradient method to minimize it. This method does not require the exact calculation of subgradients.

In order to find good starting points for minimization of the error function we formulate the auxiliary problem using polyhedral conic functions computed in previous iterations of the incremental algorithm. Such an approach allows to find either global or near global minimizers of the error function and to compute as few polyhedral conic functions as necessary to separate classes. The proposed classifier is tested using 12 test data sets. Results of numerical experiments demonstrate its high efficiency. Comparison with some mainstream classifiers show that the proposed classifier is able to produce better or similar test set accuracy in data sets used in numerical experiments. The training time of this classifier is more than that of many other classifiers however its testing is almost instantaneous for the whole test set. The proposed classifier needs few polyhedral conic functions to carry out classification task and therefore requires very low memory. All these make this classifier highly efficient for real-time classification.