Copyright © 2002 Elsevier Science (USA). All rights reserved.
Overlap matching
Amihood Amir
,
, a, b, 1, Richard Cole
,
, c, 2, Ramesh Hariharan
,
, d, 3, Moshe Lewenstein
,
,
, e and Ely Porat
,
, f
Received 10 May 2001;
Abstract
We propose a new paradigm for string matching, namely structural matching. In structural matching, the text and pattern contents are not important. Rather, some areas in the text and pattern, such as intervals, are singled out. A “match” is a text location where a specified relation between the text and pattern areas is satisfied. In particular we define the structural matching problem of overlap (parity) matching. We seek the text locations where all overlaps of the given pattern and text intervals have even length. We show that this problem can be solved in time O(nlogm), where the text length is n and the pattern length is m. As an application of overlap matching, we show how to reduce the string matching with swaps problem to the overlap matching problem. The string matching with swaps problem is the problem of string matching in the presence of local swaps. The best deterministic upper bound known for this problem was O(nm1/3logmlogσ) for a general alphabet Σ, where σ=min(m,|Σ|). Our reduction provides a solution to the pattern matching with swaps problem in time O(nlogmlogσ).
Author Keywords: Design and analysis of algorithms; Combinatorial algorithms on words; Pattern matching; Pattern matching with swaps; Non-standard pattern matching
Mathematical subject codes: 68R15
Article Outline
- 1. Introduction
- 2. Problem definition
- 3. Algorithm outline for overlap matching
- 3.1. Convolutions
- 3.2. Grouping text segments by parity of starting and ending location
- 3.3. Grouping the pattern segments
- 4. Segments with equal parity start
- 5. The odd–even even–odd segments
- 6. The odd–odd even–even segments
- 7. Swap matching
- 8. Reducing swap matching to overlap matching
- References
1. Introduction
The last few decades have prompted the evolution of pattern matching from a combinatorial solution of the exact string matching problem [13 and 17] to an area concerned with approximate matching of various relationships motivated by computational molecular biology, computer vision, and complex searches in digitized and distributed multimedia libraries [12 and 7]. To this end, two new paradigms were needed—“generalized matching” and “approximate matching”.
In generalized matching the input is still a text and pattern but the “matching” relation is defined differently. The output is all locations in the text where the pattern “matches” under the new definition of match. The different applications define the matching relation. Examples are string matching with “don’t cares” [13], parameterized matching [8 and 4], less-than matching [3], and swapped matching [22, 1 and 10]. Lower bound results on generalized matching can be found in [23].
Even under the appropriate matching relation there is still a distinction between exact matching and approximate matching. In the latter case, a distance function is defined on the text. A text location is considered a match if the distance between it and the pattern, under the given distance function, is within the tolerated bounds. Below are some examples that motivate approximate matching. In computational biology one may be interested in finding a “close” mutation, in communications one may want to adjust for transmission noise, in texts it may be desirable to allow common typing errors. In multimedia one may want to adjust for lossy compressions, occlusions, scaling, affine transformations, or dimension loss.
The earliest and best known distance functions are Levenshtein’s edit distance [20] and the Hamming distance. Let n be the text length and m the pattern length. Lowrance and Wagner [21 and 24] proposed an O(nm) dynamic programming algorithm for the extended edit distance problem. In [14, 18 and 19] O(kn) algorithms are given for the edit distance with only k allowed edit operations. Recently, Cole and Hariharan [9] presented an O(nk4/m+n+m) algorithm for this problem. Amir et al. [6] presented faster algorithms for the Hamming distance case.
Both the above paradigms have an important trait in common—matching is dependent on the alphabet symbols in the respective pattern and text locations. In this paper we propose a new paradigm—structural matching. In this model, the content of the pattern and text is not important. What is important is the structure of these strings. Certain areas in the text and pattern are identified and a “match” of the pattern in the text is a location where these special areas satisfy a required relation. Structural matching is motivated by both an “ends” and a “means” perspective.
In molecular biology, it has long been a practice to consider special areas by their structure. Examples are repetitive genomic structures [15] such as tandem repeats, LINEs (long interspersed nuclear sequences), and SINEs (short interspersed nuclear sequences) [16]. Many problems in biology can be expressed as structural matching problems, thus streamlining and identifying the combinatorial nature of the problem.
The second perspective is functional. The rich repertoire of relations between areas in the text and pattern can offer interesting tools for the solution of hitherto unresolved problems. In this paper we demonstrate such a use of structural matching in providing the fastest known algorithm for swap matching.
The pattern matching with swaps problem (the swap matching problem, for short), defined by Muthukrishnan [22], requires finding all occurrences of a pattern of length m in a text of length n. The pattern is said to match the text at a given location i if adjacent pattern characters can be swapped, if necessary, so as to make the pattern identical to the substring of the text starting at location i. All the swaps are constrained to be disjoint, i.e., each character is involved in at most one swap.
The importance of the swap matching problem lies in recent efforts to understand the complexity of various generalized pattern matching problems. Until recently there were no known upper bounds better than the naive O(nm) algorithm for the swap matching problem.
Amir et al. [1] obtained the first non-trivial results on this problem. They showed that the case when the size of the alphabet set Σ exceeds 2 can be reduced to the case when it is exactly 2 with a time overhead of O(log2σ). (The reduction overhead was reduced to O(logσ) in the journal version [2].) They then showed how to solve the problem for alphabet sets of size 2 in time O(nm1/3logm), which was the best deterministic time bound known to date. Amir et al. [5] also gave certain special cases for which O(n polylog(m)) time can be obtained. However, these cases are rather restrictive. In their TR [10] Cole and Hariharan provide a first step toward the current result by giving a randomized algorithm that solves the swap matching problem over a binary alphabet in time O(nlogn). We note that the technical report is now subsumed by this result.
In this paper we define a structural matching problem—the overlap (parity) matching problem—as follows.
Input: Text T of length n with marked intervals (substrings), and pattern P of length m with marked intervals (substrings).
Output: The text locations ℓ for which all overlaps of the marked intervals of T and marked intervals of P have even-length overlap, when the left end of P is aligned with text location ℓ.
We present a deterministic algorithm that solves the overlap matching problem in time O(nlogm).
We then reduce the swap matching problem over a binary alphabet to the overlap parity problem. Coupled with the alphabet reduction of Amir et al. [2], this gives an algorithm for swap matching over a general alphabet whose running time is O(nlogmlogσ).
There are three main contributions in this paper:
The introduction of a new model in pattern matching, that of structural matching.
An efficient solution of the overlap matching problem.
The surprising time complexity of O(nlogm) for solving the swap matching problem over binary alphabets. Until recently it was open whether the problem had a o(nm) solution.
Roadmap. In Section 2 we give basic definitions. In , 4, 5 and 6 we solve the overlap matching problem. In Section 7 we define the swap matching problem. In Section 8 we prove a key lemma and show how to utilize this lemma in order to reduce swap matching to overlap matching.
2. Problem definition
Consider a linear structure composed of contiguous units, called segments. Each segment has an associated length. A segment can be either marked or unmarked. A structural string is a concatenation of (marked and unmarked) segments. See Fig. 1 for an example.
Overlap matching is defined as follows:
Input: A structural string, P, which we will call the pattern, of length m units (i.e., the sum of the segment lengths), and a structural string, T, which we call the text, of length n
m units.
Output: All text locations k, where, when P is aligned to start at k, each overlapping marked text segment and marked pattern segment have even-length overlap.
Alternatively, we can replace the even-length overlap requirement with an odd-length overlap requirement. Since our solution to the even-length case can be massaged, without much difficulty, to the odd-length case, we will only consider the even-length case.
Another way to visualize the pattern and the text is as regular strings partitioned into “segments” with several segments marked. Thus we can consider P to be p1p2
pm and T to be t1t2
tn, the usual way of viewing a pattern and a text in string matching. However, it must be noted that the overlap matching problem is a structural problem and not a character dependent problem, whereas, the usual problems in string matching are character dependent rather than structural. Therefore, even though we use the standard string matching notation for P and T the individual characters are irrelevant to the problem.
We note that the overlap matching problem derives from a character-based string matching application, and as a result the unary encoding of lengths in our complexity measure is appropriate. If one is inclined to use the binary representation then there is a potential small gain to solving the problem naively if the number of segments is very small. However, doing better than this for a large collection of segments is non-obvious. Since our solution relies on the length of the structural strings independently of the number of segments we present our results as unary encodings.
3. Algorithm outline for overlap matching
The pattern matches at a given location when each overlap of marked segments is of even-length. Restated, we have the following property:
Overlap parity property: P matches at location i of T exactly if, when P is aligned at location i, no pair of (pattern, text) marked segments has odd-length overlap.
This allows us to consider pairs of marked segments separately. As soon as a pair is found with odd-length overlap, it immediately leads to the conclusion that there is no match in that location.
The main idea of the algorithm is to separate the marked segments of the text and pattern into a small number of groups. In each of these groups it will be possible to check for the overlap parity property in time O(nlogm) using polynomial multiplications (convolutions), which can be done in time O(nlogm) using FFT in a model with word length m bits. A formal definition is given in the following section. In the sections that follow we handle the different cases. Some of these cases necessitate new and creative uses of convolutions.
3.1. Convolutions
Convolutions are used for filtering in signal processing and other applications. A convolution uses two initial functions, t and p, to produce a third function t
p. We formally define a discrete convolution.
Definition 1. Let T be a function whose domain is {0,…,n−1} and P a function whose domain is {0,…,m−1}. We may view T and P as arrays of numbers, whose lengths are n and m, respectively. The discrete convolution of T and P is the polynomial multiplication T
P, where:
P)[j]=∑i=0m−1T[j+i]P[i], j=0,…,n−m+1.In the general case, the convolution can be computed by using the fast fourier transform (FFT) [11] on T and PR, the reverse of P. This can be done in time O(nlogm), in a computational model with word size O(logm).
Important property: The crucial property contributing to the usefulness of convolutions is the following. For every fixed location j0 in T, we are, in essence, overlaying P on T, starting at j0, i.e., P[0] corresponds to T[j0], P[1] to T[j0+1],…,P[i] to T[j0+i],…,P[m−1] to T[j0+m−1]. We multiply each element of P by its corresponding element of T and add all m resulting products. This is the convolution’s value at location j0.
Clearly, computing the convolution’s value for every text location j, can be done in time O(nm). The fortunate property of convolutions over algebraically close fields is that they can be computed for all n text locations in time O(nlogm) using the FFT.
In the next few sections we will be using this property of convolutions to efficiently compute relations of patterns in texts. This will be done via linear reductions to convolutions. In the definition below
represents the natural numbers and
represents the real numbers.
Definition 2. Let P be a pattern of length m and T a text of length n over some alphabet Σ. Let R(S1,S2) be a relation on strings of length m over Σ. We say that the relation R holds between P and location j of T if R(P[0]
P[m−1],T[j]T[j+1]
T[j+m−1]). We say that R is linearly reduced to convolutions if there exists a natural number c, a constant time computable function
, and linear time functions ℓm1,…,ℓmc and rn1,…,rnc
, where
,
, i=1,…,c such that R holds between P and location j in T iff
rn1(T)[j],ℓm2(P)
rn2(T)[j],…,ℓmc(P)
rnc(T)[j])=1.Let R be a relation that is linearly reduced to convolutions. It follows immediately from the definition that, using the FFT to compute the c convolutions, it is possible to find all locations j in T where relation R holds in time O(nlogm).
Example. Let Σ={a,b} and R the equality relation. The locations where R holds between P and T are the locations j, where T[j+i]=P[i], i=0,…,m−1. Fischer and Patterson [13] showed that it can be computed in time O(nlogm) by the following trivial reduction to convolutions. Let ℓ1=χa, ℓ2=χb,
,
, where
sn, χσ(S)=χσ(s1)χσ(s2)
χσ(sn). Let
r1(T)[j],ℓ2(P)
r2(T)[j])=0 iff there is an exact matching of P at location j of T.3.2. Grouping text segments by parity of starting and ending location
It will be important for us to know whether the marked segment we are dealing with starts at an odd or even text location. We also would like to know whether it ends at an odd or even text location. Consequently, we define new texts where each text has exactly those marked segments of a given start and end parity, with all other text elements defined as φ (don’t care) which consequently never contribute an error. (In the polynomial multiplication there will always be a 0 in these text locations.) Henceforth, for short, we will talk of multiplying a text and pattern, by which we mean a polymomial multiplication, where each of the text and pattern is viewed as a vector of coefficients for the corresponding polynomials.
Definition. Too is a string of length n where for every location i, if ti is in a marked segment whose first element is in an odd location and whose last element is in an odd location, Too[i]=1. In all other locations j, Too[j]=φ. Toe,Teo,Tee are defined in a similar fashion.
Example.
3.3. Grouping the pattern segments
The pattern segments are defined in exactly the same way as the text segments and we would like to group them in the same way. However, there is a difficulty here in “nailing down” the parity of a location, since the pattern is shifted and compared to every text location, i.e., the grouping of pattern segments needs to be related to the text parity in order to measure the parity of overlap. However, since the only property we have used in our grouping of text segments was the parity of the text location, it is clear that in all the pattern alignments the pattern segments that start in odd text locations are either all at odd pattern locations or all at even pattern locations. Similarly, these parities will be the same for all pattern alignments that start in even text locations.
Thus we limit ourselves to two cases, the odd result case and the even result case. The odd result case comprises the pattern alignments starting in odd text locations. The even result case comprises the pattern alignments starting in even text locations.
When we fix an alignment result case it is possible to define the parity of the starting and ending locations of each pattern segment. Thus we can get, for the odd result case, patterns POoo, POoe, POeo, and POee. Similarly, for the even result case we consider patterns PEoo, PEoe, PEeo, and PEee. We are now ready for the algorithm.
Algorithm.
X=O,E
-
ti=o,e
-
tj=o,e
-
pi=o,e
-
pi=o,e
-
Tti,tj
PXpi,pj { Note that when X=O we only consider the results in the odd text locations and when X=E we only consider the results in the even text locations.}
The cases we describe are the following. First fix the result parity to the odd result case. It is clear that the the even result case is symmetric. Since we do not need the X parameter in the algorithm, we will henceforth ignore it for the sake of a simpler notation. We are now down to the combinations Tti,tj and Ppi,pj. This gives us 16 cases. We will handle separately only the following three types of cases:
- Tti,tj and Ppi,pj where either ti=pi or tj=pj. (This type covers 12 cases.) These situations are handled in Section 4.
- Tti,tj and Ppi,pj where ti,tj=oe and pi,pj=eo; or where ti,tj=eo and pi,pj=oe. These cases are handled in Section 5.
- Tti,tj and Ppi,pj where ti,tj=oo and pi,pj=ee; or where ti,tj=ee and pi,pj=oo. These cases are handled in Section 6.
4. Segments with equal parity start
Consider the case Tti,tj and Ppi,pj where ti=pi.
Observation 1. For every two segments, St in Tti,tj, starting at location x, and Sp in Ppi,pj, starting at location y, |x−y| is always even.
We are interested in the length of the segment overlaps. We now show a convolution for which the resulting value at location i is 0 exactly if there is no odd-length segment overlap with the pattern starting at location i.
The convolution: Pattern P′=p′1
p′m is constructed as follows:
t′n is constructed by replacing every φ in Tti,tj by 0, and every segment in Tti,tj by a segment of alternating 1’s and −1’s, starting with 1. Then P′ and T′ are convolved. We say that a text segment of T (likewise a pattern segment of P) is relevant in relation to T′ (likewise P′) if in T′ it is marked with non-zeroes.
Lemma 1. Let (T′
P′)[q] be the qth element in the result of the convolution. (T′
P′)[q] is equal to the number of pairs of the relevant text segments and relevant pattern segments which overlap with odd overlap when placing the pattern at location q of the text.
Proof. This follows from the definitions of convolutions, of T′, and of P′ and from the observation that for all cases where the starting location of a pattern segment is smaller than the starting location of a text segment and the pair overlaps the contribution to the result of the convolution will be 1 if the length of the overlap is odd and 0 if it is even (since every text segment starts with a 1 and then alternates between −1 and 1). Because of Observation 1, even when the text segment starts at a smaller location than the pattern segment, the difference between the starting locations has even length. Therefore in the area of the overlap, the text starts with a 1 and alternates between −1 and 1. Thus the convolution gives us the desired result. □
Locations where (T′
P′)[q]=0 are locations without odd-overlap between relevant text and pattern segments.
This solves all eight cases of Tti,tj and Ppi,pj where ti=pi. For the additional four cases where tj=pj we simply reverse the text and pattern to obtain the case considered above.
5. The odd–even even–odd segments
Consider the case Toe and Peo (the case of Teo and Poe is symmetric).
Terminology: Let St be a text segment whose starting location is st and whose ending location is ft. Let Sp be a pattern segment being compared to the text at starting position sp and ending position fp. If st<sp<fp<ft then we say that St contains Sp. If sp<st<ft<fp then we say that Sp contains St. If st<sp<ft<fp then we say that St has a left overlap with Sp. If sp<st<fp<ft then we say that St has a right overlap with Sp. We will sometimes refer to a left or right overlap as a side overlap.
Observation 2. For every two segments, St in Toe and Sp in Peo if either Sp is contained in St or St is contained in Sp then the overlap is of even length. If the overlap is a left overlap or right overlap then it is of odd length. All possible cases are shown in Fig. 2 below.
| Full-size image (5K) |
Fig. 2. The cases where the text segment starts at an odd location and ends at an even location, and the pattern segment does the opposite.
The correctness of the observation is immediate. Segments of these types have even length. Thus, if one contains the other the overlap is necessarily of even length. Conversely, in case of a left or right overlap, which we call side overlaps, the overlap starting and ending locations have the same parity, making the length of the overlap odd. Remember our desire is to detect all locations where there are segments which have odd overlap. As before, we will use convolutions. The desired property for such a convolution is as follows.
Containment elimination property: Let R[j] be the result of the convolution at location j. The contribution at location j of a side overlap, i.e., the value of the convolution restricted to the overlap, to the convolution is positive and the contribution of contained and containing overlaps is 0.
To achieve this property we will actually use three convolutions instead of one and use their sum to achieve the property. The three convolutions, the overlap length convolution, the zero containment convolution, and the shifting convolution, are defined in this section.
The “desired” convolution(s) is one where segments that are contained in each other contribute a 0, and side overlaps contribute positive numbers. This convolution will be more complex than the previous one. The reason is that there is an inherent relation between text segments that start within a pattern segment. Likewise, there is a commonality between text segments that end within a pattern segment. The difficulty is that we need to differentiate between cases that are naturally more easily handled together.
To solve this we give an overview of a “wishful” convolution that gives the flavor of the solution. Consider the following case. Let St be a segment of Toe starting at text location st and having length ℓt. We assume that the pattern Peo of length ℓp is aligned to start at the text location such that the first symbol of segment Sp occurs at text location sp. We further assume that one of the segments is contained in the other. If we replace St by st1ℓt−2−st and Sp by sp1ℓp−2−sp (and φ by 0) then multiplication of these two segments will yield k−2, where k is the size of the overlap (or more precisely the term in the result corresponding to the above alignment is k−2). See Fig. 3.
Conversely, if there is a side overlap of St with Sp then the multiplication will yield max(st,sp)−min(st,sp)+k−2, where k is the size of the overlap. See Fig. 4.
• Problem 1: The described convolution is indeed a “wishful” convolution since we desire to construct convolutions for the pattern based on text locations. In fact in all n−m locations the convolution codes (reductions) need to be different.
• Problem 2: Assuming we manage to solve problem 1 we still need to remove the term “size of the overlap −2” from the result to achieve the containment overlap property.
However, if the above two problems could be solved by a convolution, then the result would be 0 exactly if all overlaps are containments, which by Observation 2 holds exactly if all overlaps have even length. We solve both problems by using novel convolutions.
5.1. The convolutions for the odd–even even–odd segments case
Solution to problem 2. The convolution below provides, for every text location i, the sum ∑j(kj−2), where kj are all the overlap lengths. All that is necessary, then, is to subtract the result of this convolution from the value obtained for each text location.
The overlap length convolution: Every segment of length ℓ in both text and pattern is replaced by 01ℓ−20. All φ’s are replaced by 0’s. The modified text and pattern are multiplied.
Solution to problem 1. In order to actually implement the “wishful” convolution we use two different convolutions and subtract one from the other.
Note that the starting locations of all text segments are known in advance and thus the substitution of the segment by st1ℓt−2−st can be done in linear time for the entire text. The pattern segment is the problematic one since it has n−m different alignments. Therefore, we treat the pattern segments as if they all start relative to location 1 rather than location i of the text. In other words, we use the pattern locations instead of the text locations, but observe that the relative difference between them is the same whether the locations are from the text or from the pattern. To make up for this we will need to add the missing differences with a different convolution.
The zero containment convolution: Every segment of length ℓt starting in location st of the text is replaced by st1ℓt−2−st. Every segment of length ℓp starting at location spp of the pattern is replaced by spp1ℓp−2−spp. All φ’s are replaced by 0’s. The modified text and pattern are multiplied.
The problem is that now the result of the multiplication of side overlaps (after subtraction of the overlap length convolution) may sometimes be positive and sometimes be negative. This may cause the zero containment convolution result to be a 0 even for overlaps that are not contained, i.e., of odd length. We will then not be able to distinguish between them and the cases of all overlaps being containments.
We now show how to adjust the zero containment convolution to give the desired result. The zero containment convolution (after subtracting the result of the overlap length convolution) causes all contained segments to give the result 0. Thus in location i0 we get the value
Aspℓ−∑k
Bspk+∑i
Csi−∑j
Dsj,The problem is that the result we really want is
Asℓ−∑k
Bsk+∑i







E-mail Article
Add to my Quick Links

Cited By in Scopus (13)


