ABSTRACT
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
- Apostolico, A. Of maps bigger than the empire, Keynote, Proceedings of the 8th International Colloquium on String Processing and Information Retrieval, Laguna de San Rafael, Chile, November 2001, IEEE Computer Society Press, 2--10 (2001).Google Scholar
- Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. Efficient detection of unusual words. J. Comput. Bio. 7, 1/2 (Jan. 2000), 71--94.Google ScholarCross Ref
- Apostolico, A., Bock, M. E., and Xu, X. Annotated statistical indices for sequence analysis. Keynote, Proceedings of Complexity and Compression of SEQUENCES97, Positano, Italy, June 1997, IEEE Computer Society Press, 215--229 (1998). Google ScholarDigital Library
- Apostolico, A., and Galil, Z., Eds. Pattern matching algorithms. Oxford University Press, New York, NY, 1997. Google ScholarDigital Library
- Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., and McConnel, R. Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34, 3 (1987), 578--595. Google ScholarDigital Library
- Borges, J. L. A Universal History of Infamy. Penguin Books, London, 1975.Google Scholar
- Clift, B., Haussler, D., McConnell, R., Schneider, T. D., and Stormo, G. D. Sequence landscapes. Nucleic Acids Res. 14 (1986), 141--158.Google ScholarCross Ref
- Gentleman, J. The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid. Appl. Statist. 43 (1994), 404--414.Google ScholarCross Ref
- Kleffe, J., and Borodovsky, M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8 (1992), 433--441.Google Scholar
- Leung, M. Y., Marsh, G. M., and Speed, T. P. Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3 (1996), 345--360.Google ScholarCross Ref
- Lonardi, S. Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Purdue University, 2001. Google ScholarDigital Library
- Lundstrom, R. Stochastic models and statistical methods for DNA sequence data. PhD thesis, University of Utah, 1990.Google Scholar
- Pevzner, P. A., Borodovsky, M. Y., and Mironov, A. A. Linguistics of nucleotides sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynamics 6 (1989), 1013--1026.Google ScholarCross Ref
- Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algorithmica 22 (1998), 631--649.Google ScholarCross Ref
- Reinert, G., Schbath, S., and Waterman, M. S. Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7 (2000), 1--46.Google ScholarCross Ref
- Stückle, E., Emmrich, C., Grob, U., and Nielsen, P. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 22 (1990), 6641--6647.Google ScholarCross Ref
- Waterman, M. S. Introduction to Computational Biology. Chapman & Hall, 1995.Google ScholarCross Ref
Index Terms
- Monotony of surprise and large-scale quest for unusual words
Recommendations
Space-Efficient Detection of Unusual Words
SPIRE 2015: Proceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 9309Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-...
Verbumculus and the discovery of unusual words
AbstractMeasures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose ...
Pattern-avoiding alternating words
A word w = w 1 w 2 w n is alternating if either w 1 < w 2 w 3 < w 4 (when the word is up-down) or w 1 w 2 < w 3 w 4 < (when the word is down-up). In this paper, we initiate the study of (pattern-avoiding) alternating words. We enumerate up-down (...
Comments