Elsevier

Information Sciences

Volume 328, 20 January 2016, Pages 42-59
Information Sciences

On the stopping criteria for k-Nearest Neighbor in positive unlabeled time series classification problems

https://doi.org/10.1016/j.ins.2015.07.061Get rights and content

Abstract

Positive unlabeled time series classification has become an important area during the last decade, as often vast amounts of unlabeled time series data are available but obtaining the corresponding labels is difficult. In this situation, positive unlabeled learning is a suitable option to mitigate the lack of labeled examples. In particular, self-training is a widely used technique due to its simplicity and adaptability. Within this technique, the stopping criterion, i.e., the decision of when to stop labeling, is a critical part, especially in the positive unlabeled context. We propose a self-training method that follows the positive unlabeled approach for time series classification and a family of parameter-free stopping criteria for this method. Our proposal uses a graphical analysis, applied to the minimum distances obtained by the k-Nearest Neighbor as the base learner, to estimate the class boundary. The proposed method is evaluated in an experimental study involving various time series classification datasets. The results show that our method outperforms the transductive results obtained by previous models.

Introduction

In this paper we tackle positive unlabeled learning (PU) [12] problems in the time series classification context. PU is a suitable approach when we dispose of a limited number of examples from a given class (the positive class) and a great amount of unlabeled data. In the time series field, unlabeled data are often easy to obtain [6], [19], [32] but obtaining the labels may require a lot of time and attention of a skilled domain expert. In such a situation, PU becomes a proper solution for binary classification problems.

The PU learning takes advantage from both labeled and unlabeled examples. It can be considered a particular case of semi-supervised learning (SSL) [9], [38], in which there are only labeled examples from a determined class. Thus, the PU learning becomes an extension of one-class classification [36] to the semi-supervised context. Despite the close relation between PU and SSL, classical methods of the latter approach cannot be directly applied to PU learning [7].

In the specialized literature, different PU learning approaches have been proposed for time series classification. Two main approaches are used. The first one is based on the use of clustering techniques [30], [31]. The second one is focused on traditional supervised learning that is the most widely used approach in time series classification [3], [10], [33], [42]. Most of these solutions adapt the self-training technique [44] that was initially proposed for standard SSL. It is a wrapper method that iteratively classifies the unlabeled examples, assuming that its more accurate predictions tend to be correct. As the base learner, it is common to focus on the k-Nearest Neighbor (k-NN) [1] because it has shown to be particularly effective for time series classification tasks [35]. As dissimilarity measures for time series in the self-training context, we can find the Euclidean distance, Dynamic Time Warping (DTW) [34] and DTW-D [10].

Due to its iterative nature, a critical part of the self-training technique is to decide when the learning should be stopped. The aim of this mechanism is to avoid enlarging the positive set by unlabeled instances with a low confidence level. This is known in the literature as the stopping criterion problem. In comparison to the standard SSL context, this issue is even more important here, because the presence of only positive labeled examples facilitates the erroneous inclusion of negative examples during the iterative process.

The pioneer work of Wei and Keogh [42] (in what follows denoted as WK) proposes a simple heuristic to stop the iterative learning based on the minimal distance decrease in the k-NN method. Frequently, this solution causes an incomplete learning of the trained classifier. An improvement of this work is addressed in Ratanamahatana and Wanichsan [33] (in what follows denoted as RW). This criterion takes advantage of the significative changes in the minimal k-NN distance between actual and previous iterations. A recent contribution [3], [37] to the stopping criterion problem tries to learn the intrinsic structure of the data using Minimum Description Length (MDL) [20]. The MDL concept has been widely applied in other domains and the work of Begum et al. [3] (in what follows denoted as BHRK) uses it to develop a novel stopping criterion. The stopping criteria proposed for the self-training technique will be analyzed in more detail in Section 2.

In this paper we delve into the stopping criterion problem in the self-training context. To do so, we base our proposal on the k-NN method and the time series measure DTW. We define different stopping criteria based on the minimum distances achieved by the k-NN in each iteration. The principal novelty of this work is the use of a specialized graphic analysis technique [13] to identify the boundary between classes. The resulting procedure is parameter-free and without additional computational efforts to obtain the stopping point.

To evaluate the performance of our proposal, we conduct experiments involving various UCR [23] classification datasets, starting the learning with only one positive labeled instance. We test the stopping criteria with different parameters of the measure selected to compute the dissimilarity between time series. The experimental study includes a statistical analysis based on non-parametric statistical tests [18]. Finally, we select the most competitive stopping criterion proposed.

The rest of this paper is organized as follows. In Section 2, we provide definitions and the notation of PU learning. In Section 3, we define the stopping criteria proposed and some theoretical considerations. In Section 4, we include the experiment design to test the criteria proposed. In Section 5, we present the results obtained in the experiments. In Section 6, we offer some conclusions.

Section snippets

Background and preliminaries

In this section we define the PU learning problem and the principal proposals in this topic (Section 2.1). Then, we review the self-training technique for PU time series classification (Section 2.2) and address the existing stopping criteria for this technique (Section 2.3).

CBD-GA: Class Boundary Description by Graphic Analysis

In this section we present a new method to identify the class boundary, denoted as Class Boundary Description using Graphical Analysis (CBD-GA). For this aim we apply concepts of graphic analysis techniques to identify the boundary in the tuple Mindist by three curves (Section 3.1). We define a family of heuristic stopping criteria (Section 3.2) using parameters calculated from these curves. Finally, we describe the performance measures used to evaluate the stopping criteria proposed (

Experimental framework

In this section, we provide a brief description of the time series datasets and include the experiment design to test the criteria proposed. As already stated in Section 2.2, the performance of the self-training depends on the fulfillment of its principal model assumption, namely that the classes form well-separated clusters. Consequently, as a measure for this we evaluate the proximity of the positive instances to the opposite class by computing the leave-one-out estimate of the 1-NN error

Results and discussion

In this section, we present the results obtained in the experiments, analyze these with a statistical testing procedure, and discuss the findings. We consider two stages during the statistical analysis. The first stage is the comparison between the five criteria proposed to define the most competitive criterion (Section 5.1). The second stage is the comparison between our most competitive stopping criterion and the state-of-the-art methods based on the self-training technique: WK, RW, and BHRK (

Conclusions

We explored the use of graphical analysis techniques to define more accurately the class boundary in the PU self-training method for time series classification. For this aim, we defined a family of stopping criteria based on the graphical analysis of the distances achieved by the 1-NN as the base learner. In particular, we described three graphic patterns related with the class boundary and used it in the proposed criteria to identify the stopping point. The experimental results allowed to

Acknowledgments

This work was supported in part by “Proyecto de Investigación de Excelencia de la Junta de Andalucía, P12-TIC-2958” and “Proyecto de Investigación del Ministerio de Economía y Competitividad, TIN2013-47210-P”. This work was partly performed while González held a travel grant from the “Asociación Universitaria Iberoamericana de Postgrado” (AUIP), supported by “Junta de Andalucía”, to undertake a research stay at University of Granada. I. Triguero holds a BOF postdoctoral fellowship from Ghent

Mabel González received the M.Sc. degree in Computer Science from the Universidad Central “Marta Abreu” de las Villas, Cuba, in 2010. She is currently a Ph.D student in the University of Granada, Granada, Spain. Her research interests include data mining, semi-supervised learning and time series classification.

References (47)

  • G. Batista et al.

    Towards automatic classification on flying insects using inexpensive sensors

    Proceedings of the 10th International Conference on Machine Learning and Applications (ICMLA)

    (2011)
  • N. Begum et al.

    A minimum description length technique for semi-supervised time series classification

    Integration of Reusable Systems. Advances in Intelligent Systems and Computing

    (2014)
  • B. Bergmann et al.

    Improvements of general multiple test procedures for redundant systems of hypotheses

    Multiple Hypotheses Testing. Medizinische Informatik und Statistik

    (1988)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

    Proceedings of the 11th Annual Conference on Computational Learning Theory

    (1998)
  • G. Bruno et al.

    Temporal pattern mining for medical applications

    Intell. Syst. Ref. Libr.

    (2012)
  • O. Chapelle et al.

    Semi-Supervised Learning

    (2006)
  • Y. Chen et al.

    DTW-D: time series semi-supervised learning from a single example

    Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13

    (2013)
  • F. Denis et al.

    Text classification and co-training from positive and unlabeled examples

    Proceedings of the ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data

    (2003)
  • F. Denis et al.

    Text classification from positive and unlabeled examples

    Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU ’02

    (2002)
  • R.D. Edwards et al.

    Technical Analysis of Stock Trends

    (2012)
  • M. Friedman

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    J. Am. Stat. Assoc.

    (1937)
  • S. García et al.

    An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons

    J. Mach. Learn. Res.

    (2008)
  • A.L. Goldberger et al.

    Physiobank, physiotoolkit, and physionet components of a new research resource for complex physiologic signals

    Circulation

    (2000)
  • Cited by (0)

    Mabel González received the M.Sc. degree in Computer Science from the Universidad Central “Marta Abreu” de las Villas, Cuba, in 2010. She is currently a Ph.D student in the University of Granada, Granada, Spain. Her research interests include data mining, semi-supervised learning and time series classification.

    Christoph Bergmeir received the M.Sc. degree in Computer Science from the University of Ulm, Germany, in 2008, and the Ph.D. degree from the University of Granada, Spain, in 2013. He is currently working at Faculty of Information Technology, Monash University, Melbourne, Australia. He has published in journals such as IEEE Transactions on Neural Networks and Learning Systems, Journal of Statistical Software, Computer Methods and Programs in Biomedicine, and Information Sciences.

    Isaac Triguero received the M.Sc. and Ph.D. degree in Computer Science from the University of Granada, Granada, Spain, in 2009 and 2014, respectively. He is currently post-doctoral researcher at the Inflammation Research Center of the Ghent University, Ghent, Belgium. He has published 19 international journal papers as well as more than 16 contributions to conferences. His research interests include data mining, data reduction, biometrics, evolutionary algorithms, semi-supervised learning and big data learning.

    Yanet Rodríguez received the M.Sc. and Ph.D. degree in Computer Science from the Universidad Central “Marta Abreu” de las Villas, Cuba, in 2000 and 2007, respectively. She has published in journals such as International Journal of Hybrid Intelligent Systems as well as more than 12 contributions to conferences. Her research interests include data mining, fuzzy systems, neural networks, data stream and time series classification.

    José Manuel Benítez (M’98) received the M.Sc. and PhD. degree in Computer Science both from the Universidad de Granada, Spain. He is currently an Associate Professor at the Department of Computer Science and Artificial Intelligence, Universidad de Granada. He is the head of the Distributed Computational Intelligence and Time Series (DiCITS) lab. His research interests include cloud computing and Big Data, Data Science, computational intelligence and time series. He has published in the leading journals of the “artificial intelligence” and Computer Science field. He has led a number of research projects funded by different international and national organizations as well as research contracts with leading international corporations.

    View full text