Skip to main content
Log in

MDL-based time series clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Time series data are pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series that have been carefully extracted from their original context, for example, gene expression profiles, individual heartbeats, or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived synthetic datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions that allow for the first time, the meaningful clustering of subsequences from a time series stream. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the minimum description length framework offers an efficient, effective, and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Athitsos V, Wang H, Stefan A (2010) A database-based framework for gesture recognition. Pers Ubiquitous Comput 14(6): 511–526

    Article  Google Scholar 

  2. Bastogne T, Noura H, Richard A, Hittinger JM (1997) Application of subspace methods to the identification of a winding process. In: Proceeding of the 4th European control conference, Brussels, Belgium

  3. Batista GEAPA, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: SDM, pp 699–710

  4. Bouchard D, Badler NI (2007) Semantic segmentation of motion capture using laban movement analysis. In: IVA, pp 37–44

  5. Chen JR (2005) Making subsequence time series clustering meaningful. In: ICDM, pp 114–121

  6. Chen JR (2007) Useful clustering outcomes from meaningful time series clustering. In: The Australasian data mining conference

  7. Chen JR (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385

    Article  Google Scholar 

  8. Chuang ZJ, Wu CH, Chen WS (2006) Movement epenthesis generation using NURBS-based spatial interpolation. IEEE Trans Circuit Syst Video Technol 16(11): 1313–1323

    Article  Google Scholar 

  9. Cook DJ, Holder LB (1994) Substructure discovery using minimum description length and background knowledge. J Artif Intell Res 1: 231–255

    Google Scholar 

  10. Das G, Lin K, Mannila H, Renganathan G, Smyth P (1998) Rule discovery from time series. In: Proceeding of the 3rd KDD, pp 16–22

  11. Denton AM, Basemann CA, Dorr DH (2009) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst J 18(1): 1–27

    Article  Google Scholar 

  12. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2): 1542–1552

    Google Scholar 

  13. Evans SC et al (2007) MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP J Bioinform Syst Biol 2007: 1–16

    Article  Google Scholar 

  14. Evans SC, Eiland E, Markham TS, Impson J, Laczo A (2007) MDLcompress for intrusion detection: signature inference and masquerade attack, MILCOM, Orlando, Florida

  15. Grünwald PD, Myung IJ, Pitt MA (2005) Advances in minimum description length: theory and applications. MIT Press, Cambridge

    Google Scholar 

  16. Jonyer I, Holder LB, Cook DJ (2004) MDL-based context-free graph grammar induction and applications. J Artif Intell Tools 13(1): 65–79

    Article  Google Scholar 

  17. Kamvar SD, Klein D, Manning CD (2002) Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In: ICML, pp 283–290

  18. Keogh EJ, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2): 154–177

    Article  Google Scholar 

  19. Keogh EJ, Lin J, Lee SH, Herle HV (2007) Finding the most unusual time series subsequence: algorithms and applications. Knowl Inf Syst 11(1): 1–27

    Article  Google Scholar 

  20. Keogh EJ, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining Knowl Discov 7(4): 349–371

    Article  MathSciNet  Google Scholar 

  21. Li H, Abe N (1996) Clustering words with the MDL principle. In: Proceeding of the 16th international conference on computational linguistics, pp 5–9

  22. Li M, Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd ed. Springer, Berlin

    MATH  Google Scholar 

  23. Molkov YI, Mukhin DN, Loskutov EM, Feigin AM (2009) Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys Rev E 80: 046207

    Article  Google Scholar 

  24. Mueen A, Keogh EJ, Shamlo NB (2009) Finding time series motifs in disk-resident data. In: ICDM pp 367–376

  25. Papadimitriou S, Sun J, Faloutsos C, Yu PS (2008) Hierarchical, parameter-free community discovery. In: PKDD pp 170–187

  26. Pednault E (1998) Some experiments in applying inductive inference principles to surface reconstruction. In: IJCAI, pp 1603–1609

  27. Reiss A, Weber M, Stricker D (2011) Exploring and extending the boundaries of physical activity recognition. In: IEEE SMC workshop on robust machine learning techniques for human activity recognition

  28. Stine RA (2004) Model selection using information theory and the MDL principle. Sociol Methods Res 33(2): 230–260

    Article  MathSciNet  Google Scholar 

  29. Supporting webpage. http://www.cs.ucr.edu/~rakthant/TSEpenthesis

  30. Tanaka Y, Iwamoto K, Uehara K (2005) Discovery of time-series motif from multi-dimensional data based on MDL principle. Mach Learn 58(2):269–300

    Google Scholar 

  31. Ueno K, Xi X, Keogh EJ, Lee DJ (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632

  32. Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2): 185–194

    MATH  Google Scholar 

  33. Yang R, Sarkar S, Loeding BL (2010) Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE PAMI 32(3): 462–477

    Article  Google Scholar 

  34. Yankov D, Keogh EJ, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanawin Rakthanmanon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rakthanmanon, T., Keogh, E.J., Lonardi, S. et al. MDL-based time series clustering. Knowl Inf Syst 33, 371–399 (2012). https://doi.org/10.1007/s10115-012-0508-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0508-7

Keywords

Navigation