Abstract
A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation of random sequences. The method is applied in detail to the Smith-Waterman algorithm when gaps are allowed, and is shown to give results very similar to those obtained by simulation.
Similar content being viewed by others
Literature
Altschul, S. F. and D. J. Lipman. 1990. Protein data-bank searches for multiple alignments.Proc. natl Acad. Sci. U.S.A. 87, 5509–5513.
Arratia, R. A., L. Gordon and M. S. Waterman. 1986. An extreme-value theory for sequence matching.Ann. Stat. 14, 971–993.
Arratia, R. A., P. Morris and M. S. Waterman. 1988. Stochastic scrabble: a law of large numbers for sequence matching with scores.J. appl. Prob. 25, 106–119.
Arratia, R. A., L. Goldstein and L. Gordon. 1989. Two moments suffice for Poisson approximation: the Chen-Stein method.Ann. Prob. 17, 9–25.
Bleasby, A. J. and J. C. Wootton. 1990. Construction of validated, non-redundant composite protein sequence databases.Prot. Engng 3, 153–159.
Collins, J. H., A. F. W. Coulson and A. Lyall. 1988. The significance of protein sequence similarities.CABIOS 4, 67–71.
Cox, D. R. and D. V. Hinkley. 1974.Theoretical Statistics. London: Chapman & Hall.
Dayhoff, M. O. and W. C. Barker. 1978.Supplement to the (1974) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC.
Devereux, J. R., P. Haeberli and O. Smithies. 1984. A comprehensive set of sequence analysis programs for the VAX.Nucl. Acids Res. 12, 387–395.
Gumbel, E. J. 1962. Statistical theory of extreme values (main results). InContributions to Order Statistics, pp. 56–93. New York: Wiley.
Karlin, S. and S. F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc. natl Acad. Sci. U.S.A. 87, 2264–2268.
Lipman, D. J. and W. R. Pearson. 1988. Improved tools for biological sequence comparison.Proc. natl. Acad. Sci. U.S.A. 85, 2444–2448.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1989. A test for the statistical significance of DNA sequence similarities for application in databank searches.CABIOS 5, 123–131.
Mott, R. F., T. B. L. Kirkwood and R. N. Curnow. 1990. An accurate approximation to the distribution of the length of the longest matching word between two random DNA sequences.Bull. math. Biol. 52, 773–784.
Needleman, S. B. and C. D. Wunsch. 1970. A general method applicable to the search for similarities on the amino-acid sequences of two proteins.J. molec. Biol. 48, 444–453.
Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences.J. molec. Biol. 147, 195–197.
Smith, T. F., M. S. Waterman and C. Burks. 1985. The statistical distribution of nucleic acid similarities.Nucl. Acids Res. 13, 645–656.
Waterman, M. S. (ed.). 1990.Mathematical Methods for DNA Sequences. Boca Raton, FL: CRC Press.
Waterman, M. S., L. Gordon and R. Arratia. 1987. Phase transitions in sequence matches and nucleic acid structure.Proc. natl Acad. Sci. U.S.A. 84, 239–1243.
Wilbur, W. J. and D. J. Lipman. 1983. Rapid similarity searches of nucleic acid and protein data banks.Proc. natl. Acad. Sci. U.S.A. 80, 726–730.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mott, R. Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bltn Mathcal Biology 54, 59–75 (1992). https://doi.org/10.1007/BF02458620
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02458620