Skip to main content
Log in

Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores

  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation of random sequences. The method is applied in detail to the Smith-Waterman algorithm when gaps are allowed, and is shown to give results very similar to those obtained by simulation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Literature

  • Altschul, S. F. and D. J. Lipman. 1990. Protein data-bank searches for multiple alignments.Proc. natl Acad. Sci. U.S.A. 87, 5509–5513.

    Article  Google Scholar 

  • Arratia, R. A., L. Gordon and M. S. Waterman. 1986. An extreme-value theory for sequence matching.Ann. Stat. 14, 971–993.

    MATH  MathSciNet  Google Scholar 

  • Arratia, R. A., P. Morris and M. S. Waterman. 1988. Stochastic scrabble: a law of large numbers for sequence matching with scores.J. appl. Prob. 25, 106–119.

    Article  MATH  MathSciNet  Google Scholar 

  • Arratia, R. A., L. Goldstein and L. Gordon. 1989. Two moments suffice for Poisson approximation: the Chen-Stein method.Ann. Prob. 17, 9–25.

    MATH  MathSciNet  Google Scholar 

  • Bleasby, A. J. and J. C. Wootton. 1990. Construction of validated, non-redundant composite protein sequence databases.Prot. Engng 3, 153–159.

    Google Scholar 

  • Collins, J. H., A. F. W. Coulson and A. Lyall. 1988. The significance of protein sequence similarities.CABIOS 4, 67–71.

    Google Scholar 

  • Cox, D. R. and D. V. Hinkley. 1974.Theoretical Statistics. London: Chapman & Hall.

    Google Scholar 

  • Dayhoff, M. O. and W. C. Barker. 1978.Supplement to the (1974) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC.

    Google Scholar 

  • Devereux, J. R., P. Haeberli and O. Smithies. 1984. A comprehensive set of sequence analysis programs for the VAX.Nucl. Acids Res. 12, 387–395.

    Google Scholar 

  • Gumbel, E. J. 1962. Statistical theory of extreme values (main results). InContributions to Order Statistics, pp. 56–93. New York: Wiley.

    Google Scholar 

  • Karlin, S. and S. F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc. natl Acad. Sci. U.S.A. 87, 2264–2268.

    Article  MATH  Google Scholar 

  • Lipman, D. J. and W. R. Pearson. 1988. Improved tools for biological sequence comparison.Proc. natl. Acad. Sci. U.S.A. 85, 2444–2448.

    Article  Google Scholar 

  • Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1989. A test for the statistical significance of DNA sequence similarities for application in databank searches.CABIOS 5, 123–131.

    Google Scholar 

  • Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 52, 773–784.

    Article  MATH  Google Scholar 

  • Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities on the amino-acid sequences of two proteins.J. molec. Biol. 48, 444–453.

    Article  Google Scholar 

  • Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences.J. molec. Biol. 147, 195–197.

    Article  Google Scholar 

  • Smith, T. F., M. S. Waterman and C. Burks. 1985. The statistical distribution of nucleic acid similarities.Nucl. Acids Res. 13, 645–656.

    Google Scholar 

  • Waterman, M. S. (ed.). 1990.Mathematical Methods for DNA Sequences. Boca Raton, FL: CRC Press.

    Google Scholar 

  • Waterman, M. S., L. Gordon and R. Arratia. 1987. Phase transitions in sequence matches and nucleic acid structure.Proc. natl Acad. Sci. U.S.A. 84, 239–1243.

    Article  MathSciNet  Google Scholar 

  • Wilbur, W. J. and D. J. Lipman. 1983. Rapid similarity searches of nucleic acid and protein data banks.Proc. natl. Acad. Sci. U.S.A. 80, 726–730.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bltn Mathcal Biology 54, 59–75 (1992). https://doi.org/10.1007/BF02458620

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02458620

Keywords

Navigation