Skip to main content
Log in

A sublinear algorithm for approximate keyword searching

  • Published:
Algorithmica Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the “database”A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N c for somec<1. The sequenceA must be overa. finite alphabet σ. More precisely, our algorithm requires 0(DNpow(ɛ) logN) expected-time where ɛ=D/P is the maximum number of differences as a percentage of query length, and pow(ɛ) is an increasing and concave function that is 0 when ɛ=0. Thus the algorithm is superior to current O(DN) algorithms when ɛ is small enough to guarantee that pow(ɛ) < 1. As seen in the paper, this is true for a wide range of ɛ, e.g., ɛ. up to 33% for DNA sequences (¦⌆¦=4) and 56% for proteins sequences (¦⌆¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman, A basic local alignment search tool,J. Molecular Biol. 215 (1990), 403–410.

    Google Scholar 

  2. Boyer, R., and J. Moore, A fast string searching algorithm,Comm. ACM 20(10) (1977), 262–272.

    Article  Google Scholar 

  3. Chang, W. I., and E. L. Lawler, Approximate matching in sublinear expected time,Proc. 31st IEEE Symp. on Foundation of Computer Science, 1990, pp. 116–124.

  4. Galil, Z., and K. Park, An improved algorithm for approximate string matching,SIAM J. Comput. 19(6) (1990), 989–999.

    Article  MATH  MathSciNet  Google Scholar 

  5. Knuth, D. E., J. H. Morris, and V. R. Pratt, Fast pattern matching in strings,SIAM J. Comput. 6(2) (1977), 323–350.

    Article  MATH  MathSciNet  Google Scholar 

  6. Landau, G. M., and U. Vishkin, Introducing efficient parallelism into approximate string matching and a new serial algorithm,Proc. Symp. on Theory of Computing, 1986, pp. 220–230.

  7. Myers, E. W., Incremental alignment algorithms and their applications, Technical Report 86-22, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1986.

    Google Scholar 

  8. Myers, E. W., AnO(ND) difference algorithm and its variants,Algorithmica 1 (1986), 251–266.

    Article  MATH  MathSciNet  Google Scholar 

  9. Myers, E. W., and D. Mount, Computer program for the IBM personal computer that searches for approximate matches to short oligonucleotide sequences in long target DNA sequences,Nucleic Acids Res. 14(1) (1986), 1025–1041.

    Article  Google Scholar 

  10. Sellers, P. H., The theory and computation of evolutionary distances: Pattern recognition,J. Algorithms 1 (1980), 359–373.

    Article  MATH  MathSciNet  Google Scholar 

  11. Ukkonen, E., Finding approximate patterns in strings,J. Algorithms 6 (1985), 132–137.

    Article  MATH  MathSciNet  Google Scholar 

  12. Ukkonen, E., Algorithms for approximate string matching,Inform, and Control 64 (1985), 100–118.

    Article  MATH  MathSciNet  Google Scholar 

  13. Wu, S., U. Manber, and E. W. Myers, A Subquadratic Algorithm for Approximate Limited Expression Matching, Technical Report TR92-36, Department of Computer Science, University of Arizona, Tucson, AZ 85721, 1992 (submitted toAlgorithmica).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Communicated by Alberto Apostolico

This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Myers, E.W. A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994). https://doi.org/10.1007/BF01185432

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01185432

Key words

Navigation