Distributions of pattern statistics in sparse Markov models

Martin, Donald E. K.

doi:10.1007/s10463-019-00714-6

Distributions of pattern statistics in sparse Markov models

Published: 05 April 2019

Volume 72, pages 895–913, (2020)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Donald E. K. Martin¹

318 Accesses
4 Citations
Explore all metrics

Abstract

Markov models provide a good approximation to probabilities associated with many categorical time series, and thus they are applied extensively. However, a major drawback associated with them is that the number of model parameters grows exponentially in the order of the model, and thus only very low-order models are considered in applications. Another drawback is lack of flexibility, in that Markov models give relatively few choices for the number of model parameters. Sparse Markov models are Markov models with conditioning histories that are grouped into classes such that the conditional probability distribution for members of each class is constant. The model gives a better handling of the trade-off between bias associated with having too few model parameters and variance from having too many. In this paper, methodology for efficient computation of pattern distributions through Markov chains with minimal state spaces is extended to the sparse Markov framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

A Moving Linear Model Approach for Extracting Cyclical Variation from Time Series Data

Article 25 November 2023

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

References

Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18, 333–340.
MathSciNet MATH Google Scholar
Aston, J. A. D., Martin, D. E. K. (2007). Waiting time distributions of general runs and patterns in hidden Markov models. Annals of Applied Statistics, 1(2), 585–611.
MathSciNet MATH Google Scholar
Begleiter, R., El-Yaniv, R., Yona, G. (2004). On prediction using variable length Markov models. Journal of Artificial Intelligence, 22, 385–421.
MATH Google Scholar
Belloni, A., Oliveira, R. (2017). Approximate group context tree. The Annals of Statistics, 45(1), 355–385.
MathSciNet MATH Google Scholar
Ben-gal, I., Morag, G., Shmilovici, A. (2003). Context-based statistical process control. Technometrics, 45(4), 293–311.
MathSciNet Google Scholar
Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection. String processing and information retrieval, Lecture Notes in Computer Science, Vol. 5280, pp. 283–293. Berlin: Springer.
Google Scholar
Bercovici, S., Rodriguez, J. M., Elmore, M., Batzoglou, S. (2012). Ancestry inference in complex admixtures via variable-length Markov chain linkage models. Research in computational molecular biology, RECOMB 2012, Lecture Notes in Computer Science, Vol. 7262, pp. 12–28. Berlin: Springer.
Google Scholar
Borges, J., Levene, M. (2007). Evaluating variable length Markov chain models for analysis of user web navigation. IEEE Transactions on Knowledge, 19(4), 441–452.
Google Scholar
Bratko, A., Cormack, G., Filipic̆, B., Lynam, T., Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7, 2673–2698.
MathSciNet MATH Google Scholar
Brookner, E. (1966). Recurrent events in a Markov chain. Information and Control, 9, 215–229.
MathSciNet MATH Google Scholar
Browning, S. R. (2006). Multilocus association mapping using variable-length Markov chains. American Journal of Human Genetics, 78, 903–913.
Google Scholar
Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. Journal of Computer and Systems Science, 70, 342–363.
MathSciNet Google Scholar
Bühlmann, P., Wyner, A. J. (1999). Variable length Markov chains. Annals of Statistics, 27(2), 480–513.
MathSciNet MATH Google Scholar
Fernández, M., García, J. E., González-López, V. A. (2018). A copula-based partition Markov procedure. Communications in Statistics-Theory and Methods, 47(14), 3408–3417.
MathSciNet Google Scholar
Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approach. Journal of the American Statistical Association, 89, 1050–1058.
MathSciNet MATH Google Scholar
Gabadinho, A., Ritschard, G. (2016). Analyzing state sequences with probabilistic suffix trees. Journal of Statistical Software, 72(3), 1–39.
Google Scholar
Gallo, S., Leonardi, F. (2015). Nonparametric statistical inference for the context tree of a stationary ergodic process. Electronic Journal of Statistics, 9, 2076–2098.
MathSciNet MATH Google Scholar
Galves, A., Galves, C., García, J. E., Garcia, N. L., Leonardi, F. (2012). Context tree selection and linguistic rhythm retrieval from written texts. Annals of Applied Statistics, 6, 186–209.
MathSciNet MATH Google Scholar
García, J. E., González-López, V. A. (2010). Minimal Markov models. arXiv:1002.0729.
García, J. E., González-López, V. A. (2017). Consistent estimation of partition Markov models. Entropy, 19, 1050–1058.
MathSciNet Google Scholar
Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.), Theory of Machines and Computation, pp. 189–196. New York: Academic Press.
Google Scholar
Jääskinen, V., Xiong, J., Koski, T., Corander, J. (2014). Sparse Markov chains for sequence data. Scandinavian Journal of Statistics, 41, 641–655.
MathSciNet MATH Google Scholar
Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3), 253–263.
MathSciNet MATH Google Scholar
Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach. Annals of the Institute of Statistical Mathematics, 47, 743–766.
MathSciNet MATH Google Scholar
Lladser, M. E. (2007). Minimal Markov chain embeddings of pattern problems. In Proceedings of the 2007 information theory and applications workshop, University of California, San Diego.
Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: A Markov chain approach. Journal of Mathematical Biology, 56(1–2), 51–92.
MathSciNet MATH Google Scholar
Ma, B., Tromp, J., Li, M. (2002). PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.
Google Scholar
Mak, D. Y. F., Benson, G. (2009). All hits all the time: Parameter-free calculation of spaced seed sensitivity. Bioinformatics, 25(3), 302–308.
Google Scholar
Marshall, T., Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. In: Ferragina, P., Landau, G.M. (eds), Proceedings of the 19th annual symposium on combinatorial pattern matching (CPM), Lecture Notes in Computer Science, Vol. 5029, pp. 95–106. Heidelberg: Springer.
Martin, D. E. K. (2018). Minimal auxiliary Markov chains through sequential elimination of states. Communications in Statistics-Simulation and Computation. https://doi.org/10.1080/03610918.2017.1406505.
Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words. Journal of Applied Probability, 48, 1049–1059.
MathSciNet MATH Google Scholar
Martin, D. E. K., Noé, L. (2017). Faster exact probabilities for statistics of overlapping pattern occurrences. Annals of the Institute of Statistical Mathematics, 69(1), 231–248.
MathSciNet Google Scholar
Noé, L. (2017). Best hits of 11110110111: Model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms for Molecular Biology, 12(1), 1. https://doi.org/10.1186/s13015-017-0092-1.
Article Google Scholar
Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string-kernels and \(k\)-mer distances. Journal of Computational Biology, 21(12), 947–963.
Google Scholar
Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45, 226–243.
MathSciNet MATH Google Scholar
Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: A DFA-only approach. Bioinformatics, 24(24), 2839–2848.
Google Scholar
Rissanen, J. (1983). A universal data compression system. IEEE Transactions on Information Theory, 29, 656–664.
MathSciNet MATH Google Scholar
Rissanen, J. (1986). Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, 32(4), 526–532.
MathSciNet MATH Google Scholar
Ron, D., Singer, Y., Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(2–3), 117–149.
MATH Google Scholar
Roos, T., Yu, B. (2009). Sparse Markov source estimation via transformed Lasso. In Proceedings of the IEEE Information Theory Workshop (ITW-2009), pp. 241–245. Taormina, Sicily, Italy.
Shmilovici, A., Ben-gal, I. (2007). Using a VOM model for reconstructing potential coding regions in EST sequences. Computational Statistics, 22, 49–69.
MathSciNet MATH Google Scholar
Weinberger, M., Lempel, A., Ziv, J. (1992). A sequential algorithm for the universal coding of finite memory sources. IEEE Transactions on Information Theory, IT–38, 1002–1024.
MathSciNet MATH Google Scholar
Weinberger, M., Rissanen, J., Feder, M. (1995). A universal finite memory source. IEEE Transactions on Information Theory, 41(3), 643–652.
MATH Google Scholar
Willems, F. M. J., Shtarkov, Y. M., Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3), 653–664.
MATH Google Scholar
Xiong, J., Jääskinen, V., Corander, J. (2016). Recursive learning for sparse Markov models. Bayesian Analysis, 11(1), 247–263.
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1811933. The author would like to thank the reviewer for their insightful comments on the original version of the manuscript.

Author information

Authors and Affiliations

Department of Statistics, North Carolina State University, 4272 SAS Hall, 2311 Stinson Drive, Raleigh, NC, 27695-8203, USA
Donald E. K. Martin

Authors

Donald E. K. Martin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Donald E. K. Martin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Martin, D.E.K. Distributions of pattern statistics in sparse Markov models. Ann Inst Stat Math 72, 895–913 (2020). https://doi.org/10.1007/s10463-019-00714-6

Download citation

Received: 28 July 2018
Revised: 27 January 2019
Published: 05 April 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s10463-019-00714-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributions of pattern statistics in sparse Markov models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Moving Linear Model Approach for Extracting Cyclical Variation from Time Series Data

A Systematic Review of Hidden Markov Models and Their Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

Distributions of pattern statistics in sparse Markov models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Moving Linear Model Approach for Extracting Cyclical Variation from Time Series Data

A Systematic Review of Hidden Markov Models and Their Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation