Skip to main content
Log in

Distributions of pattern statistics in sparse Markov models

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Markov models provide a good approximation to probabilities associated with many categorical time series, and thus they are applied extensively. However, a major drawback associated with them is that the number of model parameters grows exponentially in the order of the model, and thus only very low-order models are considered in applications. Another drawback is lack of flexibility, in that Markov models give relatively few choices for the number of model parameters. Sparse Markov models are Markov models with conditioning histories that are grouped into classes such that the conditional probability distribution for members of each class is constant. The model gives a better handling of the trade-off between bias associated with having too few model parameters and variance from having too many. In this paper, methodology for efficient computation of pattern distributions through Markov chains with minimal state spaces is extended to the sparse Markov framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aho, A. V., Corasick, M. J. (1975). Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18, 333–340.

    MathSciNet  MATH  Google Scholar 

  • Aston, J. A. D., Martin, D. E. K. (2007). Waiting time distributions of general runs and patterns in hidden Markov models. Annals of Applied Statistics, 1(2), 585–611.

    MathSciNet  MATH  Google Scholar 

  • Begleiter, R., El-Yaniv, R., Yona, G. (2004). On prediction using variable length Markov models. Journal of Artificial Intelligence, 22, 385–421.

    MATH  Google Scholar 

  • Belloni, A., Oliveira, R. (2017). Approximate group context tree. The Annals of Statistics, 45(1), 355–385.

    MathSciNet  MATH  Google Scholar 

  • Ben-gal, I., Morag, G., Shmilovici, A. (2003). Context-based statistical process control. Technometrics, 45(4), 293–311.

    MathSciNet  Google Scholar 

  • Benson, G., Mak, D. Y. F. (2009). Exact distribution of a spaced seed statistic for DNA homology detection. String processing and information retrieval, Lecture Notes in Computer Science, Vol. 5280, pp. 283–293. Berlin: Springer.

    Google Scholar 

  • Bercovici, S., Rodriguez, J. M., Elmore, M., Batzoglou, S. (2012). Ancestry inference in complex admixtures via variable-length Markov chain linkage models. Research in computational molecular biology, RECOMB 2012, Lecture Notes in Computer Science, Vol. 7262, pp. 12–28. Berlin: Springer.

    Google Scholar 

  • Borges, J., Levene, M. (2007). Evaluating variable length Markov chain models for analysis of user web navigation. IEEE Transactions on Knowledge, 19(4), 441–452.

    Google Scholar 

  • Bratko, A., Cormack, G., Filipic̆, B., Lynam, T., Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7, 2673–2698.

    MathSciNet  MATH  Google Scholar 

  • Brookner, E. (1966). Recurrent events in a Markov chain. Information and Control, 9, 215–229.

    MathSciNet  MATH  Google Scholar 

  • Browning, S. R. (2006). Multilocus association mapping using variable-length Markov chains. American Journal of Human Genetics, 78, 903–913.

    Google Scholar 

  • Buhler, J., Keich, U., Sun, Y. (2005). Designing seeds for similarity search in genomic DNA. Journal of Computer and Systems Science, 70, 342–363.

    MathSciNet  Google Scholar 

  • Bühlmann, P., Wyner, A. J. (1999). Variable length Markov chains. Annals of Statistics, 27(2), 480–513.

    MathSciNet  MATH  Google Scholar 

  • Fernández, M., García, J. E., González-López, V. A. (2018). A copula-based partition Markov procedure. Communications in Statistics-Theory and Methods, 47(14), 3408–3417.

    MathSciNet  Google Scholar 

  • Fu, J. C., Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approach. Journal of the American Statistical Association, 89, 1050–1058.

    MathSciNet  MATH  Google Scholar 

  • Gabadinho, A., Ritschard, G. (2016). Analyzing state sequences with probabilistic suffix trees. Journal of Statistical Software, 72(3), 1–39.

    Google Scholar 

  • Gallo, S., Leonardi, F. (2015). Nonparametric statistical inference for the context tree of a stationary ergodic process. Electronic Journal of Statistics, 9, 2076–2098.

    MathSciNet  MATH  Google Scholar 

  • Galves, A., Galves, C., García, J. E., Garcia, N. L., Leonardi, F. (2012). Context tree selection and linguistic rhythm retrieval from written texts. Annals of Applied Statistics, 6, 186–209.

    MathSciNet  MATH  Google Scholar 

  • García, J. E., González-López, V. A. (2010). Minimal Markov models. arXiv:1002.0729.

  • García, J. E., González-López, V. A. (2017). Consistent estimation of partition Markov models. Entropy, 19, 1050–1058.

    MathSciNet  Google Scholar 

  • Hopcroft, J. E. (1971). An \(n\) log \(n\) algorithm for minimizing states in a finite automaton. In Z. Kohavi & A. Paz (Eds.), Theory of Machines and Computation, pp. 189–196. New York: Academic Press.

    Google Scholar 

  • Jääskinen, V., Xiong, J., Koski, T., Corander, J. (2014). Sparse Markov chains for sequence data. Scandinavian Journal of Statistics, 41, 641–655.

    MathSciNet  MATH  Google Scholar 

  • Keich, U., Li, M., Ma, B., Tromp, J. (2004). On spaced seeds for similarity search. Discrete Applied Mathematics, 138(3), 253–263.

    MathSciNet  MATH  Google Scholar 

  • Koutras, M. V., Alexandrou, V. A. (1995). Runs, scans and urn models: A unified Markov chain approach. Annals of the Institute of Statistical Mathematics, 47, 743–766.

    MathSciNet  MATH  Google Scholar 

  • Lladser, M. E. (2007). Minimal Markov chain embeddings of pattern problems. In Proceedings of the 2007 information theory and applications workshop, University of California, San Diego.

  • Lladser, M., Betterton, M. D., Knight, R. (2008). Multiple pattern matching: A Markov chain approach. Journal of Mathematical Biology, 56(1–2), 51–92.

    MathSciNet  MATH  Google Scholar 

  • Ma, B., Tromp, J., Li, M. (2002). PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18(3), 440–445.

    Google Scholar 

  • Mak, D. Y. F., Benson, G. (2009). All hits all the time: Parameter-free calculation of spaced seed sensitivity. Bioinformatics, 25(3), 302–308.

    Google Scholar 

  • Marshall, T., Rahmann, S. (2008). Probabilistic arithmetic automata and their application to pattern matching statistics. In: Ferragina, P., Landau, G.M. (eds), Proceedings of the 19th annual symposium on combinatorial pattern matching (CPM), Lecture Notes in Computer Science, Vol. 5029, pp. 95–106. Heidelberg: Springer.

  • Martin, D. E. K. (2018). Minimal auxiliary Markov chains through sequential elimination of states. Communications in Statistics-Simulation and Computation. https://doi.org/10.1080/03610918.2017.1406505.

  • Martin, D. E. K., Coleman, D. A. (2011). Distributions of clump statistics for a collection of words. Journal of Applied Probability, 48, 1049–1059.

    MathSciNet  MATH  Google Scholar 

  • Martin, D. E. K., Noé, L. (2017). Faster exact probabilities for statistics of overlapping pattern occurrences. Annals of the Institute of Statistical Mathematics, 69(1), 231–248.

    MathSciNet  Google Scholar 

  • Noé, L. (2017). Best hits of 11110110111: Model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms for Molecular Biology, 12(1), 1. https://doi.org/10.1186/s13015-017-0092-1.

    Article  Google Scholar 

  • Noé, L., Martin, D. E. K. (2014). A coverage criterion for spaced seeds and its applications to SVM string-kernels and \(k\)-mer distances. Journal of Computational Biology, 21(12), 947–963.

    Google Scholar 

  • Nuel, G. (2008). Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45, 226–243.

    MathSciNet  MATH  Google Scholar 

  • Ribeca, P., Raineri, E. (2008). Faster exact Markovian probability functions for motif occurrences: A DFA-only approach. Bioinformatics, 24(24), 2839–2848.

    Google Scholar 

  • Rissanen, J. (1983). A universal data compression system. IEEE Transactions on Information Theory, 29, 656–664.

    MathSciNet  MATH  Google Scholar 

  • Rissanen, J. (1986). Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, 32(4), 526–532.

    MathSciNet  MATH  Google Scholar 

  • Ron, D., Singer, Y., Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25(2–3), 117–149.

    MATH  Google Scholar 

  • Roos, T., Yu, B. (2009). Sparse Markov source estimation via transformed Lasso. In Proceedings of the IEEE Information Theory Workshop (ITW-2009), pp. 241–245. Taormina, Sicily, Italy.

  • Shmilovici, A., Ben-gal, I. (2007). Using a VOM model for reconstructing potential coding regions in EST sequences. Computational Statistics, 22, 49–69.

    MathSciNet  MATH  Google Scholar 

  • Weinberger, M., Lempel, A., Ziv, J. (1992). A sequential algorithm for the universal coding of finite memory sources. IEEE Transactions on Information Theory, IT–38, 1002–1024.

    MathSciNet  MATH  Google Scholar 

  • Weinberger, M., Rissanen, J., Feder, M. (1995). A universal finite memory source. IEEE Transactions on Information Theory, 41(3), 643–652.

    MATH  Google Scholar 

  • Willems, F. M. J., Shtarkov, Y. M., Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3), 653–664.

    MATH  Google Scholar 

  • Xiong, J., Jääskinen, V., Corander, J. (2016). Recursive learning for sparse Markov models. Bayesian Analysis, 11(1), 247–263.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1811933. The author would like to thank the reviewer for their insightful comments on the original version of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Donald E. K. Martin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martin, D.E.K. Distributions of pattern statistics in sparse Markov models. Ann Inst Stat Math 72, 895–913 (2020). https://doi.org/10.1007/s10463-019-00714-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-019-00714-6

Keywords

Navigation