Elsevier

Information Systems

Volume 28, Issue 7, October 2003, Pages 867-883
Information Systems

Similarity search of time-warped subsequences via a suffix tree

https://doi.org/10.1016/S0306-4379(02)00102-3Get rights and content

Abstract

This paper proposes an indexing technique for fast retrieval of similar subsequences using the time-warping distance. The time-warping distance is a more suitable similarity measure than the Euclidean distance in many applications where sequences may be of different lengths and/or different sampling rates. The proposed indexing technique employs a disk-based suffix tree as an index structure and uses lower-bound distance functions to filter out dissimilar subsequences without false dismissals. To make the index structure compact and hence accelerate the query processing, it converts sequences in the continuous domain into sequences in the discrete domain and stores only a subset of the suffixes whose first values are different from those of the immediately preceding suffixes. Extensive experiments with real and synthetic data sequences revealed that the proposed approach significantly outperforms the sequential scan and LB scan approaches and scales well in a large volume of sequence databases.

Introduction

A similarity search in sequence databases is an operation that finds sequences or subsequences whose changing patterns are similar to that of a given query sequence [1], [2], [3]. Similarity search is of growing importance in many new application domains such as information retrieval, data mining and clustering. Especially in the medical domain, a search for patients with similar disease evolution patterns can augment the process of patient care by providing physicians with insight into the treatment of previous patients with similar medical conditions.

The sequential scan method for similarity search reads each sequence or subsequence sequentially from the database and computes its distance to a query sequence. This method is simple but suffers from severe performance degradation when the database is large. Therefore, an effective indexing scheme is essential as a scalable solution for similarity search.

Most of the previous indexing techniques [1], [3], [4] for similarity search use the Euclidean distance metric. However, in many applications, the sampling rates and/or the lengths of sequences may be different, making it difficult or impossible to use the Euclidean distance as a similarity measure. In the area of speech recognition [5], this problem has been approached using a similarity measure, called the time-warping distance [5], [6], which allows sequences to be stretched or compressed along the time axis. Under time-warping, any element of a sequence can be matched to one or more neighboring elements of another sequence. As an example [7], let us consider two sequences, X=〈20,20,21,21,20,20,23,23〉 and Y = 〈20,21,20,23〉 where the sequence X is the closing price of a stock taken every day and Y is the closing price of another stock taken every other day. X and Y cannot be compared directly because the sequence X is longer than Y. The Euclidean distance between Y and any subsequence of length four of X is greater than 1.41. However, if we replicate every element of Y using time warping, we find that the two sequences are identical.

It is important to prevent the occurrence of false dismissals [1] in similarity search. A false dismissal is defined as missing a part of the final query result. Indexing techniques that assume the triangular inequality directly or indirectly may produce false dismissals when the distance function not satisfying the triangular inequality is used as a similarity measure [4]. Unfortunately the time-warping distance does not satisfy the triangular inequality, which can be simply proved by a counter example [4]. This property makes spatial access methods based on the triangular inequality unsuitable for similarity search with the time-warping distance.

In the area of string matching, a suffix tree [8] has been extensively used as an index structure to find the substrings that are exactly matched to a given query string. A suffix tree may be a good candidate for an index structure with the time-warping distance because it does not assume any geometry or any underlying distance functions. However, the following problems have to be addressed so that a suffix tree can be used in similarity search: (1) A suffix tree is designed for exact matching of substrings. Its search algorithm needs to be extended for similarity-based matching of subsequences. (2) A suffix tree is usually built from sequences in the discrete domain; however, sequences we consider in this paper are from the continuous domain. A systematic method to convert continuous values into discrete values is required.

This paper proposes a new indexing technique for the fast retrieval of similar subsequences of different lengths and/or different sampling rates. The proposed technique employs the time-warping distance as a similarity metric and a disk-based suffix tree as an index structure. To reduce the index size, it converts sequences of continuous values into sequences of discrete values and stores only a subset of suffixes whose first values are different from those of the immediately preceding suffixes. When a query sequence Q is submitted, a suffix tree is traversed from the root and the time-warping distances between Q and the subsequences contained in a suffix tree are computed. Because the subsequences contained in a suffix tree are of discrete values, their exact distances to Q cannot be obtained. Therefore, the proposed approach employs lower-bound distance functions to estimate the exact distance without false dismissals.

This paper is organized as follows. Section 2 provides a brief overview of the related work on sequence matching problems and Section 3 gives the definition and property of the time-warping distance. Section 4 introduces the index construction and query processing algorithms for a disk-based suffix tree. The ideas of a categorization and a sparse suffix tree are applied to the similarity search algorithms in 5 Similarity search using categorization, 6 Similarity search with sparse suffix tree, respectively. And, Section 7 compares the proposed algorithm with the sequential scan and LB scan algorithms.

Section snippets

Related work

There has been much research on similarity search in sequence databases. Agrawal et al. [1] proposed the F-Index, a similarity searching technique for whole sequence matching. Sequences are converted into the frequency domain by the Discrete Fourier Transform (DFT) and are subsequently mapped into multi-dimensional points that are managed by an R-tree [9]; this technique was extended to locate similar subsequences [3]. Since both approaches use the Euclidean distance, sequences of different

Time-warping distance

Finding a similarity measure for sequences is not easy because sequences that are qualitatively the same may be different quantitatively. First, the sequences may be of different lengths, making it difficult or impossible to embed the sequences in a metric space and use the Euclidean distance to determine similarity. Second, the sampling rates of sequences may be different: one sequence may be sampled every minute while another sequence is sampled every other minute. Such differences in rates

Similarity search using a suffix tree

This section proposes to use a suffix tree as an index structure for similarity search with the time-warping distance. Before describing the methods for constructing and traversing a suffix tree, we present the definition and internal structure of a suffix tree.

A trie is a data structure used for indexing a set of keywords. A suffix trie [8] is a trie whose set of keywords comprises the suffixes of sequences. Nodes with a single outgoing edge can be collapsed, yielding the structure known as a

Similarity search using categorization

This section introduces the concept of categorization to decrease the number of possible values that elements can take, hence increasing the length and the number of common subsequences. As explained in the previous section, the index size and the query processing time reduce as the length and the number of common subsequences increase.

To get the categorized representation of element values, we first generate the set of categories and determine their ranges. Then we convert every element value

Similarity search with sparse suffix tree

A suffix tree that stores only a subset of suffixes is called a sparse suffix tree [27]. Since the size of a suffix tree is linear in the number of leaf nodes, a sparse suffix tree is smaller than an original suffix tree. Suffixes inserted into a tree are called stored suffixes, and suffixes not inserted into a tree are called non-stored suffixes. The reduction of the index size by storing only a subset of suffixes is measured by the compaction ratio r (0⩽r<1) that is defined as follows:r=the

Experimental evaluation

To study the performance and scalability of the proposed similarity search algorithms, we conducted extensive experiments with real and synthetic data sets. This section first describes the evaluation environment and then chooses the one from the proposed similarity search algorithms after comparing them in terms of space and time efficiency. We then compare the performance and scalability with the previous approaches.

Conclusion

This paper presented an indexing method based on a disk-based suffix tree, for fast retrieval of similar subsequences without false dismissals. Because the sampling rates and the lengths of sequences may be different, the proposed method uses the time-warping distance as a similarity measure that allows stretching or compressing of sequences along the time axis. Extensive experiments with real and synthetic data sequences revealed that the proposed approach significantly outperforms the

References (27)

  • R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, in: Proceedings of International...
  • R. Agrawal, K. Lin, H.S. Sawhney, K. Shim, Fast similarity search in the presence of noise, scaling, and translation in...
  • C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, in: Proceedings of...
  • B.-K. Yi, H.V. Jagadish, C. Faloutsos, Efficient retrieval of similar time sequences under time warping, in:...
  • L. Rabinar et al.

    Fundamentals of Speech Recognition

    (1993)
  • D.J. Berndt, J. Clifford, Finding patterns in time series: a dynamic programming approach, in: U.M. Fayyad, G....
  • D. Rafiei, A. Mendelzon, Similarity-based queries for time series data, in: Proceedings of the ACM International...
  • G.A. Stephen

    String Searching Algorithms

    (1994)
  • N. Beckmann, H. Kriegel, R. Schneider, B. Seeger, The R*-tree: an efficient and robust access method for points and...
  • D.Q. Goldin, P.C. Kanellakis, On similarity queries for time-series data: constraint specification and implementation,...
  • A. Guttman, R-trees: a dynamic index structure for spatial searching, in: Proceedings of the ACM SIGMOD, Boston, MA,...
  • T. Bozkaya, N. Yazdani, M. Ozsoyoglu, Matching and indexing sequences of different lengths, in: Proceedings of the ACM...
  • C. Faloutsos, K. Lin, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and...
  • Cited by (23)

    • Efficient moving average transform-based subsequence matching algorithms in time-series databases

      2007, Information Sciences
      Citation Excerpt :

      Likewise, subsequence matching is a generalization of whole matching [9,15–17,24], and thus we focus on subsequence matching in this paper. Several transform techniques such as moving average transform [14,21], shifting & scaling [2,7,20], normalization transform [15,21], and time warping [12,19,25,26] have been used in similar sequence matching to solve the problems encountered by the Euclidean distance function. Among these transforms, we focus on moving average transform, which has been widely used in econometrics [6].

    • OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences

      2003, Proceedings 2003 VLDB Conference: 29th International Conference on Very Large Databases (VLDB)
    • Survey of similarity search for multivariate time series

      2017, Kongzhi yu Juece/Control and Decision
    View all citing articles on Scopus

    Recommended by Dr. Nick Koudas.

    View full text