Abstract
Time series data are pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series that have been carefully extracted from their original context, for example, gene expression profiles, individual heartbeats, or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived synthetic datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions that allow for the first time, the meaningful clustering of subsequences from a time series stream. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the minimum description length framework offers an efficient, effective, and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses.
Similar content being viewed by others
References
Athitsos V, Wang H, Stefan A (2010) A database-based framework for gesture recognition. Pers Ubiquitous Comput 14(6): 511–526
Bastogne T, Noura H, Richard A, Hittinger JM (1997) Application of subspace methods to the identification of a winding process. In: Proceeding of the 4th European control conference, Brussels, Belgium
Batista GEAPA, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: SDM, pp 699–710
Bouchard D, Badler NI (2007) Semantic segmentation of motion capture using laban movement analysis. In: IVA, pp 37–44
Chen JR (2005) Making subsequence time series clustering meaningful. In: ICDM, pp 114–121
Chen JR (2007) Useful clustering outcomes from meaningful time series clustering. In: The Australasian data mining conference
Chen JR (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385
Chuang ZJ, Wu CH, Chen WS (2006) Movement epenthesis generation using NURBS-based spatial interpolation. IEEE Trans Circuit Syst Video Technol 16(11): 1313–1323
Cook DJ, Holder LB (1994) Substructure discovery using minimum description length and background knowledge. J Artif Intell Res 1: 231–255
Das G, Lin K, Mannila H, Renganathan G, Smyth P (1998) Rule discovery from time series. In: Proceeding of the 3rd KDD, pp 16–22
Denton AM, Basemann CA, Dorr DH (2009) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst J 18(1): 1–27
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2): 1542–1552
Evans SC et al (2007) MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP J Bioinform Syst Biol 2007: 1–16
Evans SC, Eiland E, Markham TS, Impson J, Laczo A (2007) MDLcompress for intrusion detection: signature inference and masquerade attack, MILCOM, Orlando, Florida
Grünwald PD, Myung IJ, Pitt MA (2005) Advances in minimum description length: theory and applications. MIT Press, Cambridge
Jonyer I, Holder LB, Cook DJ (2004) MDL-based context-free graph grammar induction and applications. J Artif Intell Tools 13(1): 65–79
Kamvar SD, Klein D, Manning CD (2002) Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In: ICML, pp 283–290
Keogh EJ, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2): 154–177
Keogh EJ, Lin J, Lee SH, Herle HV (2007) Finding the most unusual time series subsequence: algorithms and applications. Knowl Inf Syst 11(1): 1–27
Keogh EJ, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining Knowl Discov 7(4): 349–371
Li H, Abe N (1996) Clustering words with the MDL principle. In: Proceeding of the 16th international conference on computational linguistics, pp 5–9
Li M, Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd ed. Springer, Berlin
Molkov YI, Mukhin DN, Loskutov EM, Feigin AM (2009) Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys Rev E 80: 046207
Mueen A, Keogh EJ, Shamlo NB (2009) Finding time series motifs in disk-resident data. In: ICDM pp 367–376
Papadimitriou S, Sun J, Faloutsos C, Yu PS (2008) Hierarchical, parameter-free community discovery. In: PKDD pp 170–187
Pednault E (1998) Some experiments in applying inductive inference principles to surface reconstruction. In: IJCAI, pp 1603–1609
Reiss A, Weber M, Stricker D (2011) Exploring and extending the boundaries of physical activity recognition. In: IEEE SMC workshop on robust machine learning techniques for human activity recognition
Stine RA (2004) Model selection using information theory and the MDL principle. Sociol Methods Res 33(2): 230–260
Supporting webpage. http://www.cs.ucr.edu/~rakthant/TSEpenthesis
Tanaka Y, Iwamoto K, Uehara K (2005) Discovery of time-series motif from multi-dimensional data based on MDL principle. Mach Learn 58(2):269–300
Ueno K, Xi X, Keogh EJ, Lee DJ (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2): 185–194
Yang R, Sarkar S, Loeding BL (2010) Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE PAMI 32(3): 462–477
Yankov D, Keogh EJ, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rakthanmanon, T., Keogh, E.J., Lonardi, S. et al. MDL-based time series clustering. Knowl Inf Syst 33, 371–399 (2012). https://doi.org/10.1007/s10115-012-0508-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0508-7