Skip to main content
Log in

Improving SVM classification on imbalanced time series data sets with ghost points

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Imbalanced data sets present a particular challenge to the data mining community. Often, it is the rare event that is of interest and the cost of misclassifying the rare event is higher than misclassifying the usual event. When the data is highly skewed toward the usual, it can be very difficult for a learning system to accurately detect the rare event. There have been many approaches in recent years for handling imbalanced data sets, from under-sampling the majority class to adding synthetic points to the minority class in feature space. However, distances between time series are known to be non-Euclidean and non-metric, since comparing time series requires warping in time. This fact makes it impossible to apply standard methods like SMOTE to insert synthetic data points in feature spaces. We present an innovative approach that augments the minority class by adding synthetic points in distance spaces. We then use Support Vector Machines for classification. Our experimental results on standard time series show that our synthetic points significantly improve the classification rate of the rare events, and in most cases also improves the overall accuracy of SVMs. We also show how adding our synthetic points can aid in the visualization of time series data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aach J, Church GM (2001) Aligning gene expression time series with time warping algorithms. Bioinformatics 17: 495–508

    Article  Google Scholar 

  2. Aizerman MA, Braverman EA, Rozonoer L (1964) Theoretical foundations of the potential function method in pattern recognition learning. In: Automation and remote control, vol 25, pp 821–837

  3. Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of ECML’04, pp 39–50

  4. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1): 20–29

    Article  Google Scholar 

  5. Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: KDD Workshop, pp 359–370

  6. Bishop CM (2007) Pattern recognition and machine learning (Information Science and Statistics), 1st ed. 2006. corr. 2nd printing edn, Springer

  7. Chan P, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: In Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, pp 164–168

  8. Chawla NV, Bowyer KW, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357

    MATH  Google Scholar 

  9. Chawla NV, Lazarevic A, Hall LO , Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: Proceedings of the principles of knowledge discovery in databases, PKDD-2003, pp 107–119

  10. Cieslak DA, Chawla NV (2008) Start globally, optimize locally, predict globally: improving performance on imbalanced data. In: ‘ICDM’08: Proceedings of the 2008 eighth IEEE international conference on data mining’, IEEE Computer Society, Washington, DC, USA, pp 143–152

  11. Georgiou C, Hatami H (2008) CSC2414- Metric embeddings. Lecture 1: A brief introduction to metric embeddings, examples and motivation’

  12. Giorgino T (2009) Computing and visualizing dynamic time warping alignments in R: the dtw package. Journal of Statistical Software 31(7): 1–24

    Google Scholar 

  13. Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning, vol 3644 of Lecture Notes in Computer Science, Springer, pp 878–887

  14. Hovsepian K, Anselmo P, Mazumdar S (2010) Supervised inductive learning with LotkaVolterra derived models. Knowl Inf Syst

  15. Keogh E, Xi X, Wei L, Ratanamahatana CA (2006) Ucr time series classification/clustering page, Website. http://www.cs.ucr.edu/~eamonn/time_series_data/

  16. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Machine Learning 30(2–3): 195–215

    Article  Google Scholar 

  17. Latecki LJ, Wang Q, Köknar-Tezel S, Megalooikonomou V (2007) Optimal subsequence bijection. IEEE International conference on data Mining, pp 565–570

  18. Latecki LJ, Lakaemper R, Eckhardt U (2000) Shape descriptors for non-rigid shapes with a single closed contour. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 424–429

  19. Laub J, Müller K-R (2004) Feature discovery in non-metric pairwise data. J Mach Learn Res 5: 801–818

    Google Scholar 

  20. Matousek J (2002) Lectures on Discrete Geometry. Springer-Verlag New York Inc., Secaucus, NJ, USA

    MATH  Google Scholar 

  21. Mena L, Gonzalez J (2006) Machine learning for imbalanced datasets: Application in medical diagnostic. In: In proceedings of the 19th international FLAIRS conference

  22. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290: 2323–2326

    Article  Google Scholar 

  23. Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26: 43–49

    Article  MATH  Google Scholar 

  24. Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290: 2319–2323

    Article  Google Scholar 

  25. Tufte ER (2001) The visual display of quantitative information, 2nd edition. Graphics Press, Cheshire, CT

    Google Scholar 

  26. van Rijsbergen C (1979) In: Information retrieval. Butterworths, London

  27. Vapnik VN (1995) The nature of statistical learning theory. Springer-Verlag New York Inc., New York, NY, USA

    MATH  Google Scholar 

  28. Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst

  29. Weber M, Alexa M, Müller W (2001) Visualizing time-series on spirals, In: Proceedings of the IEEE symposium on information visualization 2001 (INFOVIS’01), pp 7–14

  30. Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor Newsl 6(1): 7–19

    Article  Google Scholar 

  31. Weiss GM, Hirsh H (1998) Learning to predict rare events in event sequences. In: In Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, pp 359–363

  32. Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354

    MATH  Google Scholar 

  33. Woods K, Doss C, Bowyer K, Solka J, Priebe C, Kegelmeyer P (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int J Pattern Recognit Artif Intell 7: 1417–1436

    Article  Google Scholar 

  34. Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: Workshop on learning from imbalanced datasets in international conference on machine learning (ICML)

  35. Yang X, Bai X, Latecki LJ, Tu Z (2008) Improving shape retrieval by learning graph transduction. In: ‘ECCV (4)’, Vol 5305 of Lecture Notes in Computer Science, Springer, pp 788–801

  36. Yi BK, Jagadish HV, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: Proceedings of international conference on data engineering (ICDE98), pp 201–208

  37. Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. In: Advances in neural information processing systems 17. MIT Press, pp 1601–1608

  38. Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3): 321–334

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suzan Köknar-Tezel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Köknar-Tezel, S., Latecki, L.J. Improving SVM classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 28, 1–23 (2011). https://doi.org/10.1007/s10115-010-0310-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0310-3

Keywords

Navigation