Shape-based retrieval in time-series databases

https://doi.org/10.1016/j.jss.2005.05.004Get rights and content

Abstract

The shape-based retrieval is defined as the operation that searches for the (sub)sequences whose shapes are similar to that of a query sequence regardless of their actual element values. In this paper, we propose a similarity model suitable for shape-based retrieval and present an indexing method for supporting the similarity model. The proposed similarity model enables to retrieve similar shapes accurately by providing the combination of multiple shape-preserving transformations such as normalization, moving average, and time warping. Our indexing method stores every distinct subsequence concisely into the disk-based suffix tree for efficient and adaptive query processing. We allow the user to dynamically choose a similarity model suitable for a given application. More specifically, we allow the user to determine the parameter p of the distance function Lp when submitting a query. The result of extensive experiments revealed that our approach not only successfully finds the subsequences whose shapes are similar to a query shape but also significantly outperforms the sequential scan method.

Introduction

The time-series database is a set of data sequences (hereafter, we simply call them sequences), each of which is an ordered list of elements (Agrawal et al., 1993). Sequences of stock prices, money exchange rates, temperature data, product sales data, and company growth rates are the typical examples of time-series databases (Agrawal et al., 1995a, Faloutsos et al., 1994). Similarity search is an operation that finds sequences or subsequences whose changing patterns are similar to that of a given query sequence (Agrawal et al., 1993, Agrawal et al., 1995a, Faloutsos et al., 1994). Similarity search is of growing importance in many new applications such as data mining and data warehousing (Chen et al., 1996, Rafiei and Mendelzon, 1997).

In order to measure the similarity of any two sequences of length n, most approaches (Agrawal et al., 1993, Chu and Wong, 1999, Faloutsos et al., 1994, Goldin and Kanellakis, 1995, Rafiei and Mendelzon, 1997, Rafiei, 1999) map the sequences into points in n-dimensional space and compute the Euclidean distance between those points as a similarity measure. However, they often miss the data sequences that are actually similar to a query sequence in users’ perspective. Therefore, recent work on similarity search tends to support various types of transformations such as scaling (Agrawal et al., 1995a, Chu and Wong, 1999), shifting (Agrawal et al., 1995a, Chu and Wong, 1999), normalization (Agrawal et al., 1995a, Chu and Wong, 1999, Das et al., 1997, Goldin and Kanellakis, 1995, Loh et al., 2001), moving average (Loh et al., 2000, Rafiei and Mendelzon, 1997, Rafiei, 1999), and time warping (Berndt and Clifford, 1996, Kim et al., 2001, Park et al., 2000, Park et al., 2001, Yi et al., 1998).

This paper addresses the problem of shape-based retrieval that finds the sequences whose shapes are similar to that of a given query sequence regardless of their actual element values. To provide a flexible solution to this problem, this paper introduces a new similarity model that employs combinations of multiple transformations such as shifting, scaling, moving average, and time warping.

In particular, our similarity model supports multiple Lp distance functions in order to measure the similarity between the finally transformed two sequences; If a user chooses one among the Manhattan distance L1, the Euclidean distance L2, and the maximum distance L, the proposed method performs the shape-based retrieval by using the chosen distance function. The flexibility in choosing multiple distance functions is fairly useful since users could have different opinions about similarity depending on applications. In addition, an important feature of the proposed method is to perform the shape-based retrieval that supports these three distance functions by using only one index built in advance.

Similarity search is classified into whole matching and subsequence matching (Agrawal et al., 1993).

  • Whole matching: Given N data sequences S1, …, SN, a query sequence Q, and a tolerance ε, we find such data sequences Si that are similar to Q. Here, we note that the data and query sequences should be of the same length.

  • Subsequence matching: Given N data sequences S1, …, SN of varying lengths, a query sequence Q, and the tolerance ε, we find all the sequences Si, one or more subsequences of which are similar to Q, and the offsets in Si of those subsequences. Here, the data and query sequences are allowed to be of arbitrary lengths.

Since subsequence matching is a generalization of whole matching, it is applicable to practical applications more than whole matching.

In this paper, we propose a novel method for processing of shape-based subsequence retrieval. We first define an effective similarity model for shape-based subsequence retrieval and present the indexing and query processing methods for performing the shape-based retrieval that supports this model. To verify the superiority of the approach, we perform extensive experiments by using a variety of data sets. The results reveal that our approach successfully finds all the subsequences that have the shapes similar to that of the query sequence, and also achieves high search performance.

This paper is organized as follows. Section 2 briefly reviews previous work related to similarity search. Section 3 defines the notation and terminology used in this paper and introduces our similarity model. Section 4 presents the indexing method for supporting the proposed similarity model, and Section 5 describes our query processing method. Section 6 presents the experimental results to show the superiority of our method, and finally, Section 7 summarizes and concludes the paper.

Section snippets

Related work

In this section, we briefly survey previous research results associated with similarity search in time-series databases.

Agrawal et al. (1993) proposed a method for whole matching in time-series databases. First, each data sequence of length n is transformed into a point in f (≪n) dimensional space by using the discrete Fourier transform (DFT). For indexing a large number of such points, an R-tree (Beckmann et al., 1990) is used. For whole matching, a query sequence of length l is also

Problem definition

This section formally defines the problem we are going to solve. Section 3.1 defines the notation and terminology used in this paper. Section 3.2 describes our similarity model.

Indexing

This section describes the indexing method for efficient shape-based retrieval. Section 4.1 briefly reviews the suffix tree, the underlying index structure in our method. Section 4.2 discusses the indexing strategy for utilizing the suffix tree. Section 4.3 describes the index construction steps in detail, and Section 4.4 presents the technique for index compression.

Query processing

This section presents the query processing method for shape-based retrieval of similar subsequences, and shows its computation complexity.

Performance evaluation

This section presents the experimental results for performance evaluation of the proposed method. Section 6.1 describes the environment for experiments and Section 6.2 shows and analyzes experimental results.

Conclusions

This paper discussed the problem of shape-based retrieval in time-series databases. This paper defined a new similarity model for shape-based subsequence retrieval, and also proposed the indexing and query processing methods for supporting this similarity model efficiently.

The proposed similarity model supports a combination of transformations such as shifting, scaling, moving average, and time warping, and allows users to choose an Lp distance function to computing the similarity between the

Acknowledgments

This work has been supported by Korea Research Foundation with Grant KRF-2003-041-D00486, the IT Research Center via Kangwon National University, and the University Research Program (C1-2002-146-0-3) of IITA. Sang-Wook Kim would like to thank Jung-Hee Seo, Suk-Yeon Hwang, Grace (Joo-Young) Kim, and Joo-Sung Kim for their encouragement and support.

References (28)

  • Agrawal, R., Faloutsos, C., Swami, A., 1993. Efficient similarity search in sequence databases. In: Proc. FODO. pp....
  • Agrawal, R., Lin, K., Sawhney, H.S., Shim, K., 1995. Fast similarity search in the presence of noise, scaling, and...
  • Agrawal, R., Psaila, G., Wimmers, E.L., Zäit, M., 1995. Querying shapes of histories. In: Proc. VLDB. pp....
  • Beckmann, N., Kriegel, H., Schneider, R., Seeger, B., 1990. The R∗-tree: an efficient and robust access method for...
  • D.J. Berndt et al.

    Finding patterns in time series: a dynamic programming approach

  • C. Chatfield

    The Analysis of Time-series: an Introduction

    (1984)
  • M.S. Chen et al.

    Data mining: an overview from database perspective

    IEEE TKDE

    (1996)
  • Chu, K.W., Wong, M.H., 1999. Fast time-series searching with scaling and shifting. In: Proc. ACM PODS. pp....
  • Das, G., Gunopulos, D., Mannila, H., 1997. Finding similar time series. In: Proc. PKDD, pp....
  • Faloutsos, C., Ranganathan, M., Manolopoulos, Y., 1994. Fast subsequence matching in time-series databases. In: Proc....
  • Goldin, D.Q., Kanellakis, P.C., 1995. On similarity queries for time-series data: constraint specification and...
  • M. Kendall

    Time-series

    (1979)
  • Kim, S.W., Park, S., Chu, W.W., 2001. An index-based approach for similarity search supporting time warping in large...
  • Loh, W.K., Kim, S.W., Whang, K.Y., 2000. Index interpolation: an approach for subsequence matching supporting...
  • Cited by (11)

    • Representation model and learning algorithm for uncertain and imprecise multivariate behaviors, based on correlated trends

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      To deal with the imprecision in the observations, we propose to use a contour segmentation as a way to capture the structure of a behavior. This is also the alternative that other researchers have taken [14,29]. Our approach for dealing with the identification of behaviors is based on the acquisition of what we have called Frequent Correlated Trends, that characterize each particular behavior to be identified.

    • Improved piecewise vector quantized approximation based on normalized time subsequences

      2013, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      c) Can the normalization improve the performance of the method? In fact, a shortcoming of Euclidean distance [1,23,24] is shown as in Fig. 2. Directly perceived through the senses, the two time subsequences X and Y are very similar because subsequence X can be obtained by shifting up vertically subsequence Y. However, if we use Eq. (2) to measure the distance, they will be considered as the dissimilar ones.

    View all citing articles on Scopus
    View full text