MDL-based time series clustering

Rakthanmanon, Thanawin; Keogh, Eamonn J.; Lonardi, Stefano; Evans, Scott

doi:10.1007/s10115-012-0508-7

MDL-based time series clustering

Regular Paper
Published: 12 June 2012

Volume 33, pages 371–399, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Thanawin Rakthanmanon¹,
Eamonn J. Keogh¹,
Stefano Lonardi¹ &
…
Scott Evans²

855 Accesses
42 Citations
1 Altmetric
Explore all metrics

Abstract

Time series data are pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series that have been carefully extracted from their original context, for example, gene expression profiles, individual heartbeats, or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived synthetic datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions that allow for the first time, the meaningful clustering of subsequences from a time series stream. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the minimum description length framework offers an efficient, effective, and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Athitsos V, Wang H, Stefan A (2010) A database-based framework for gesture recognition. Pers Ubiquitous Comput 14(6): 511–526
Article Google Scholar
Bastogne T, Noura H, Richard A, Hittinger JM (1997) Application of subspace methods to the identification of a winding process. In: Proceeding of the 4th European control conference, Brussels, Belgium
Batista GEAPA, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: SDM, pp 699–710
Bouchard D, Badler NI (2007) Semantic segmentation of motion capture using laban movement analysis. In: IVA, pp 37–44
Chen JR (2005) Making subsequence time series clustering meaningful. In: ICDM, pp 114–121
Chen JR (2007) Useful clustering outcomes from meaningful time series clustering. In: The Australasian data mining conference
Chen JR (2007) Making clustering in delay-vector space meaningful. Knowl Inf Syst 11(3): 369–385
Article Google Scholar
Chuang ZJ, Wu CH, Chen WS (2006) Movement epenthesis generation using NURBS-based spatial interpolation. IEEE Trans Circuit Syst Video Technol 16(11): 1313–1323
Article Google Scholar
Cook DJ, Holder LB (1994) Substructure discovery using minimum description length and background knowledge. J Artif Intell Res 1: 231–255
Google Scholar
Das G, Lin K, Mannila H, Renganathan G, Smyth P (1998) Rule discovery from time series. In: Proceeding of the 3rd KDD, pp 16–22
Denton AM, Basemann CA, Dorr DH (2009) Pattern-based time-series subsequence clustering using radial distribution functions. Knowl Inf Syst J 18(1): 1–27
Article Google Scholar
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2): 1542–1552
Google Scholar
Evans SC et al (2007) MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP J Bioinform Syst Biol 2007: 1–16
Article Google Scholar
Evans SC, Eiland E, Markham TS, Impson J, Laczo A (2007) MDLcompress for intrusion detection: signature inference and masquerade attack, MILCOM, Orlando, Florida
Grünwald PD, Myung IJ, Pitt MA (2005) Advances in minimum description length: theory and applications. MIT Press, Cambridge
Google Scholar
Jonyer I, Holder LB, Cook DJ (2004) MDL-based context-free graph grammar induction and applications. J Artif Intell Tools 13(1): 65–79
Article Google Scholar
Kamvar SD, Klein D, Manning CD (2002) Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In: ICML, pp 283–290
Keogh EJ, Lin J (2005) Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowl Inf Syst 8(2): 154–177
Article Google Scholar
Keogh EJ, Lin J, Lee SH, Herle HV (2007) Finding the most unusual time series subsequence: algorithms and applications. Knowl Inf Syst 11(1): 1–27
Article Google Scholar
Keogh EJ, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining Knowl Discov 7(4): 349–371
Article MathSciNet Google Scholar
Li H, Abe N (1996) Clustering words with the MDL principle. In: Proceeding of the 16th international conference on computational linguistics, pp 5–9
Li M, Vitanyi P (1997) An introduction to Kolmogorov complexity and its applications, 2nd ed. Springer, Berlin
MATH Google Scholar
Molkov YI, Mukhin DN, Loskutov EM, Feigin AM (2009) Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys Rev E 80: 046207
Article Google Scholar
Mueen A, Keogh EJ, Shamlo NB (2009) Finding time series motifs in disk-resident data. In: ICDM pp 367–376
Papadimitriou S, Sun J, Faloutsos C, Yu PS (2008) Hierarchical, parameter-free community discovery. In: PKDD pp 170–187
Pednault E (1998) Some experiments in applying inductive inference principles to surface reconstruction. In: IJCAI, pp 1603–1609
Reiss A, Weber M, Stricker D (2011) Exploring and extending the boundaries of physical activity recognition. In: IEEE SMC workshop on robust machine learning techniques for human activity recognition
Stine RA (2004) Model selection using information theory and the MDL principle. Sociol Methods Res 33(2): 230–260
Article MathSciNet Google Scholar
Supporting webpage. http://www.cs.ucr.edu/~rakthant/TSEpenthesis
Tanaka Y, Iwamoto K, Uehara K (2005) Discovery of time-series motif from multi-dimensional data based on MDL principle. Mach Learn 58(2):269–300
Google Scholar
Ueno K, Xi X, Keogh EJ, Lee DJ (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp 623–632
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2): 185–194
MATH Google Scholar
Yang R, Sarkar S, Loeding BL (2010) Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE PAMI 32(3): 462–477
Article Google Scholar
Yankov D, Keogh EJ, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Inf Syst 17(2): 241–262
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92521, USA
Thanawin Rakthanmanon, Eamonn J. Keogh & Stefano Lonardi
GE Global Research, Niskayuna, NY, USA
Scott Evans

Authors

Thanawin Rakthanmanon
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn J. Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lonardi
View author publications
You can also search for this author in PubMed Google Scholar
Scott Evans
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanawin Rakthanmanon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rakthanmanon, T., Keogh, E.J., Lonardi, S. et al. MDL-based time series clustering. Knowl Inf Syst 33, 371–399 (2012). https://doi.org/10.1007/s10115-012-0508-7

Download citation

Received: 11 December 2011
Revised: 23 March 2012
Accepted: 07 April 2012
Published: 12 June 2012
Issue Date: November 2012
DOI: https://doi.org/10.1007/s10115-012-0508-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

MDL-based time series clustering

Abstract

Access this article

Similar content being viewed by others

Accelerating the discovery of unsupervised-shapelets

Introducing the contrast profile: a novel time series primitive that allows real world classification

Model-Based Clustering Methods for Time Series

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MDL-based time series clustering

Abstract

Access this article

Similar content being viewed by others

Accelerating the discovery of unsupervised-shapelets

Introducing the contrast profile: a novel time series primitive that allows real world classification

Model-Based Clustering Methods for Time Series

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation