Characteristic-Based Clustering for Time Series Data

Wang, Xiaozhe; Smith, Kate; Hyndman, Rob

doi:10.1007/s10618-005-0039-x

Characteristic-Based Clustering for Time Series Data

Published: 16 May 2006

Volume 13, pages 335–364, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Xiaozhe Wang¹,
Kate Smith¹ &
Rob Hyndman²

13k Accesses
387 Citations
3 Altmetric
Explore all metrics

Abstract

With the growing importance of time series clustering research, particularly for similarity searches amongst long time series such as those arising in medicine or finance, it is critical for us to find a way to resolve the outstanding problems that make most clustering methods impractical under certain circumstances. When the time series is very long, some clustering algorithms may fail because the very notation of similarity is dubious in high dimension space; many methods cannot handle missing data when the clustering is based on a distance metric.

This paper proposes a method for clustering of time series based on their structural characteristics. Unlike other alternatives, this method does not cluster point values using a distance metric, rather it clusters based on global features extracted from the time series. The feature measures are obtained from each individual series and can be fed into arbitrary clustering algorithms, including an unsupervised neural network algorithm, self-organizing map, or hierarchal clustering algorithm.

Global measures describing the time series are obtained by applying statistical operations that best capture the underlying characteristics: trend, seasonality, periodicity, serial correlation, skewness, kurtosis, chaos, nonlinearity, and self-similarity. Since the method clusters using extracted global measures, it reduces the dimensionality of the time series and is much less sensitive to missing or noisy data. We further provide a search mechanism to find the best selection from the feature set that should be used as the clustering inputs.

The proposed technique has been tested using benchmark time series datasets previously reported for time series clustering and a set of time series datasets with known characteristics. The empirical results show that our approach is able to yield meaningful clusters. The resulting clusters are similar to those produced by other methods, but with some promising and interesting variations that can be intuitively explained with knowledge of the global characteristics of the time series.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tiered Clustering for Time Series Data

Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences

TSX-Means: An Optimal K Search Approach for Time Series Clustering

References

Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proc. of the 4th International Conference on Foundations of Data Organization and Algorithms, Chicago, IL, USA, pp. 69–84.
Armstrong, J.S. (Ed.), 2001. Principles of Forecasting: A Handbook for Researchers and Practitioners. Kluwer Academic Publishers.
Atkinson, A.C. and Riani, M. 2000. Robust Diagnostic Regression Analysis, New York: Springer.
MATH Google Scholar
Berndt, D. and Clifford, J. 1994. Using dynamic time warping to find patterns in time series. In Proc. of the AAAI’94 Workshop on Knowledge Discovery in Databases, pp. 229–248.
Box, G.E.P. and Cox, D.R. 1964. An analysis of transformations. JRSS, B(26):211–246.
MATH MathSciNet Google Scholar
Box, G.E.P. and Pierce, D.A. 1970. Distribution of the residual autocorrelations in autoregressive-integrated moving-average time series models. Journal of the American Statistical Association, 65:1509–1526.
Article MATH MathSciNet Google Scholar
Bradley, P.S. and Fayyad, U.M. 1998. Refining initial points for k-means clustering. In Proc. of the 15th International Conference on Machine Learning, Madison, WI, USA, pp. 91–99.
Chan, K. and Fu, A.W. 1999. Efficient time series matching by wavelets. In Proc. of the 15th IEEE International conference on data engineering, Sydney, Australia, pp. 126–133.
Chatfield, C. 1996. The Analysis of Time Series: An Introduction. London: Chapman & Hall.
Google Scholar
Chu, K. and Wong, M. 1999. Fast time-series searching with scaling and shifting. In Proc. of the 18th ACM Symposium on Principles of Database Systems, Philadelphia, PA, USA, pp. 237–248.
Cleveland, R.B., Cleveland, W.S., McRae, J.E., and Terpenning, I. 1990. Stl: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6:3 –73.
Google Scholar
Cleveland, W.S. 1994. The Elements of Graphing Data. NJ: Hobart Press Summit.
Google Scholar
Cox, D.R. 1984. Long-range dependence: A review. In Proc. of the Statistics: An Appraisal, 50th Anniversary Conference, Iowa State Statistical Laboratory, pp. 55–74.
Debregeas, A. and Hebrail, G. 1998. Interactive interpretation of kohonen maps applied to curves. In Proc. of the 4th International Conference of Knowledge Discovery and Data Mining, New York, NY, USA, pp. 179–183.
Dellaert, F.T. Polzin, T., and Waibel, A. 1996. Recognizing emotion in speech. In Proc. of the 4th International Conference on Spoken Language Processing, Philadelphia, PA, USA, pp. 1970–1973.
Deng, K. Moore, A. and Nechyba, M.C. 1997. Learning to recognize time series: Combining arma models with memory-based learning. In Proc. of the International Symposium on Computational Intelligence in Robotics and Automation, pp. 246–50.
Domingos, P. 1999. Role of occam's razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409–425.
Article Google Scholar
Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. 1994. Fast subsequence matching in time-series databases. In Proc. of the ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, USA, pp. 419–429.
Ge, X. and Smyth, P. 2000. Deformable markov model templates for time-series pattern matching. In Proc. of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, pp. 81–90.
Grossi, L. and Riani, M. 2002. Robust time series analysis through the forward search. In Proc. of the 15th Symposium of Computational Statistics, Berlin, Germany, pp. 521–526.
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. 2001. On clustering validation techniques. Journal of Intelligent Information Systems (JIIS), 17(2/3):107–145.
Google Scholar
Hamilton, J.D. 1994. Time Series Analysis. Princeton University Press, Princeton.
MATH Google Scholar
Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J., and Ostrowski, E. 1994. A Handbook of Small Data Sets. Chapman & Hall, London.
Google Scholar
Harvill, J.L., Ray, B.K., and Harvill, J.L. 1999. Testing for nonlinearity in a vector time series. Biometrika, 86:728–734.
Article MATH MathSciNet Google Scholar
Haslett, J. and Raftery, A.E. 1989. Space-time modelling with long-memory dependence: Assessing ireland's wind power resource (with discussion). Applied Statistics, 38:1–50.
Article Google Scholar
Hilborn, R.C. 1994. Chaos and Nonlinear Dynamics: An Introduction for Scientists and Engineers. Oxford University Press, New York.
Google Scholar
Honkela, T. 1997. Self-Organizing maps in natural language processing, Ph.D. Thesis, Neural Networks Research Centre, Helsinki University of Technology.
Hosking, J.R.M. 1984. Modeling persistence in hydrological time series using fractional differencing. Water Resources Research, 20(12):1898–1908.
Google Scholar
Huntala, Y., Karkkainen, J., and Toivonen, H. 1999. Mining for similarities in aligned time series using wavelets. In Proc. of the Data Mining and Knowledge Discovery: Theory, Tools, and Technology, Orlando, FL, pp. 150–160.
Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proc. of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp. 363–372.
Jain, A.K., Murty, M.N., and Flynn, P.J. 1999. Data clustering: A review. ACM Computing Surveys, 31(3):265–323.
Article Google Scholar
Kalpakis, K., Gada, D., and Puttagunta, V. 2001. Distance measures for effective clustering of arima time-series. In Proc. of the IEEE International Conference on Data Mining, San Jose, CA, pp. 273–280.
Keogh, E. and Smyth, P. 1997. A probabilistic approach to fast pattern matching in time series databases. In Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, pp. 20–24.
Keogh, E., Chakrabarti, K., Pazzani, M.J., and Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, USA, pp. 151–162.
Keogh, E. and Folias, T. 2002. The ucr Time Series Data Mining Archive. http:/www.cs.ucr.edu/∼eamonn /TSDMA/index.html.
Keogh, E. and Kasetty, S. 2002. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, pp. 102–111.
Keogh, E., Lin, J., and Truppel, W. 2003. Clustering of time series subsequences is meaningless: Implications for past and future research. In Proc. of the 3rd IEEE International Conference on Data Mining, Melbourne, FL, USA, pp. 115–122.
Keogh, E., Lonardi, S., and Ratanamahatana, C. 2004. Towards parameter-free data mining. In Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, pp. 206–215.
Kohonen, T., Oja, M., Kaski, S., and Somervuo, P. 2002. Self–Organizing map. Biennial report 2000–2001.
Lange, T., Roth, V., Braun, M.L., and Buhmann, J.M. 2004. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299–1323.
Article MATH Google Scholar
Lee, T.-H. 2001. Neural network test and nonparametric kernel test for neglected nonlinearity in regression models. Studies in Nonlinear Dynamics & Econometrics, 4(4):169–182.
Article MATH Google Scholar
Lin, J., Vlachos, M., Keogh, E., and Gunopulos, D. 2004. Iterative incremental clustering of time series. In Proc. of the IX Conference on Extending Database Technology, Crete, Greece, pp. 106–122.
Lu, Z.-Q. 1996. Estimating lyapunov exponents in chaotic time series with locally weighted regression, Ph.D. Thesis, Department of Statistics, University of North Carolina.
Makridakis, S., Wheelwright, S.C., and Hyndman, R.J. 1998. Forecasting methods and applications. John Wiley & Sons, Inc.
Mörchen, F. 2003. Time series feature extraction for data mining using dwt and dft. Technical Report No. 33.
Nanopoulos, A., Alcock, R., and Manolopoulos, Y. 2001. Feature-based Classification of Time-Series Data. International Journal of Computer Research. NY: Nova Science Publishers, pp. 49–61.
Google Scholar
Popivanov, I. and Miller, R.J. 2002. Similarity search over time series data using wavelets. In Proc. of the 18th International Conference on Data Engineering, San Jose, CA, USA, pp. 212–221.
Pyle, D. 1999. Data Preparation for Data Mining. San Francisco, California: Morgan Kaufmann Publishers, Inc.
Google Scholar
Ratanamahatana, C.A. and Keogh, E. 2005. Three myths about dynamic time warping. In Proc. of the SIAM International Conference on Data Mining, Newport Beach, CA, pp. 506–510.
Rocca, M.L. and Perna, C. 2004. Subsampling model selection in neural networks for nonlinear time series analysis. In Proc. of the 36th Symposium on the Interface, Baltimore, Maryland,
Rose, O. 1996. Estimation of the Hurst Parameter of Long-Range Dependent Time Series. Research Report, 137.
Royston, P. 1982. An extension of shapiro and wilk's w test for normality to large samples. Applied Statistics, 31:115–124.
Article MATH Google Scholar
Scargle, J.D. 2000. Timing: New methods for astronomical time series analysis. Bulletin of the American Astronomical Society, 32:1438.
Google Scholar
Teräesvirta, T, Lin, C.F, and Granger, C.W.J. 1993. Power of the neural network linearity test. Journal of Time Series Analysis, 14(209–220)
Google Scholar
Teräesvirta, T. 1996. Power properties of linearity tests for time series. Studies in Nonlinear Dynamics & Econometrics, 1(1):3–10.
Article Google Scholar
Van Laerhoven, K. 2001. Combining the knohonen self-organizing map and k-means for on-line classification of sensor data. Artificial neural networks, lecture notes in artificial intelligence. Springer Verlag, pp. 464–70.
Wallace, C.S. 1999. Minimum Description Length. The Mit Encyclopedia of the Cognitive Science. The MIT Press, London, England, pp. 550–551.
Google Scholar
Wang, C. and Wang, X.S. 2000. Supporting content-based searches on time series via approximation. In Proc. of the 12th International Conference on Scientific and Statistical Database Management, Berlin, Germany, pp. 69–81.
Willinger, W, Paxon, V, and Taqqu, M.S. 1996. Self-similarity and heavy tails: Structural modeling of network traffic. A Practical Guide to Heavy Tails: Statistical Techniques and Applications: 27–53.
Wolf, A. Swift, J.B. Swinney, H.L. and Vastano, J.A. 1985. Determining lyapunov exponents from a time series. PHYSICA D, 16:285–317.
Article MATH MathSciNet Google Scholar
Wood, S.N. 2000. Modelling and smoothing parameter estimation with multiple quadratic penalties. J. R. Statist. Soc. B, 62(2):413–428.
Article Google Scholar

Download references

Acknowledgment

The first author is grateful to Monash University for providing a Postgraduate Publication Award to support the manuscript preparation.

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Clayton, Victoria, 3800, Australia
Xiaozhe Wang & Kate Smith
Department of Econometrics and Business Statistics, Monash University, Clayton, Victoria, 3800, Australia
Rob Hyndman

Authors

Xiaozhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kate Smith
View author publications
You can also search for this author in PubMed Google Scholar
Rob Hyndman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaozhe Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Smith, K. & Hyndman, R. Characteristic-Based Clustering for Time Series Data. Data Min Knowl Disc 13, 335–364 (2006). https://doi.org/10.1007/s10618-005-0039-x

Download citation

Received: 12 July 2005
Accepted: 12 December 2005
Published: 16 May 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10618-005-0039-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Characteristic-Based Clustering for Time Series Data

Abstract

Access this article

Similar content being viewed by others

Tiered Clustering for Time Series Data

Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences

TSX-Means: An Optimal K Search Approach for Time Series Clustering

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Characteristic-Based Clustering for Time Series Data

Abstract

Access this article

Similar content being viewed by others

Tiered Clustering for Time Series Data

Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences

TSX-Means: An Optimal K Search Approach for Time Series Clustering

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation