Article

Monotony of surprise and large-scale quest for unusual words

Authors:
Alberto Apostolico

Universitá di Padova, Padova, Italy and Purdue University, West Lafayette, IN

Universitá di Padova, Padova, Italy and Purdue University, West Lafayette, IN
View Profile

,
Mary Ellen Bock

Purdue University, West Lafayette, IN

Purdue University, West Lafayette, IN
View Profile

,
Stefano Lonardi

University of California, Riverside, CA

University of California, Riverside, CA
View Profile

RECOMB '02: Proceedings of the sixth annual international conference on Computational biologyApril 2002Pages 22–31https://doi.org/10.1145/565196.565200

Published:18 April 2002Publication History

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

Pages 22–31

ABSTRACT

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

References

Apostolico, A. Of maps bigger than the empire, Keynote, Proceedings of the 8th International Colloquium on String Processing and Information Retrieval, Laguna de San Rafael, Chile, November 2001, IEEE Computer Society Press, 2--10 (2001).Google Scholar
Apostolico, A., Bock, M. E., Lonardi, S., and Xu, X. Efficient detection of unusual words. J. Comput. Bio. 7, 1/2 (Jan. 2000), 71--94.Google ScholarCross Ref
Apostolico, A., Bock, M. E., and Xu, X. Annotated statistical indices for sequence analysis. Keynote, Proceedings of Complexity and Compression of SEQUENCES97, Positano, Italy, June 1997, IEEE Computer Society Press, 215--229 (1998). Google ScholarDigital Library
Apostolico, A., and Galil, Z., Eds. Pattern matching algorithms. Oxford University Press, New York, NY, 1997. Google ScholarDigital Library
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., and McConnel, R. Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34, 3 (1987), 578--595. Google ScholarDigital Library
Borges, J. L. A Universal History of Infamy. Penguin Books, London, 1975.Google Scholar
Clift, B., Haussler, D., McConnell, R., Schneider, T. D., and Stormo, G. D. Sequence landscapes. Nucleic Acids Res. 14 (1986), 141--158.Google ScholarCross Ref
Gentleman, J. The distribution of the frequency of subsequences in alphabetic sequences, as exemplified by deoxyribonucleic acid. Appl. Statist. 43 (1994), 404--414.Google ScholarCross Ref
Kleffe, J., and Borodovsky, M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci. 8 (1992), 433--441.Google Scholar
Leung, M. Y., Marsh, G. M., and Speed, T. P. Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3 (1996), 345--360.Google ScholarCross Ref
Lonardi, S. Global Detectors of Unusual Words: Design, Implementation, and Applications to Pattern Discovery in Biosequences. PhD thesis, Purdue University, 2001. Google ScholarDigital Library
Lundstrom, R. Stochastic models and statistical methods for DNA sequence data. PhD thesis, University of Utah, 1990.Google Scholar
Pevzner, P. A., Borodovsky, M. Y., and Mironov, A. A. Linguistics of nucleotides sequences I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynamics 6 (1989), 1013--1026.Google ScholarCross Ref
Régnier, M., and Szpankowski, W. On pattern frequency occurrences in a Markovian sequence. Algorithmica 22 (1998), 631--649.Google ScholarCross Ref
Reinert, G., Schbath, S., and Waterman, M. S. Probabilistic and statistical properties of words: An overview. J. Comput. Bio. 7 (2000), 1--46.Google ScholarCross Ref
Stückle, E., Emmrich, C., Grob, U., and Nielsen, P. Statistical analysis of nucleotide sequences. Nucleic Acids Res. 18, 22 (1990), 6641--6647.Google ScholarCross Ref
Waterman, M. S. Introduction to Computational Biology. Chapman & Hall, 1995.Google ScholarCross Ref

Index Terms

Monotony of surprise and large-scale quest for unusual words
1. Applied computing
  1. Life and medical sciences
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Space-Efficient Detection of Unusual Words
SPIRE 2015: Proceedings of the 22nd International Symposium on String Processing and Information Retrieval - Volume 9309

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-...
Read More
Verbumculus and the discovery of unusual words
Abstract
Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose ...
Read More
Pattern-avoiding alternating words

A word w = w 1 w 2 w n is alternating if either w 1 < w 2 w 3 < w 4 (when the word is up-down) or w 1 w 2 < w 3 w 4 < (when the word is down-up). In this paper, we initiate the study of (pattern-avoiding) alternating words. We enumerate up-down (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RECOMB '02: Proceedings of the sixth annual international conference on Computational biology
April 2002
341 pages
ISBN:1581134983
DOI:10.1145/565196
Editors:
Gene Myers
Celera, USA
,
Sridhar Hannenhalli
Celera, USA
,
David Sankoff
University of Montréal, Canada
,
Sorin Istrail
Celera, USA
,
Pavel Pevzner
University of California at San Diego, USA
,
Michael Waterman
University of California, USA
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
RECOMB '02 Paper Acceptance Rate35of118submissions,30%Overall Acceptance Rate148of538submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 356
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Monotony of surprise and large-scale quest for unusual words

RECOMB '02: Proceedings of the sixth annual international conference on Computational biology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Space-Efficient Detection of Unusual Words

Verbumculus and the discovery of unusual words

Pattern-avoiding alternating words