skip to main content
research-article

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

Published:05 March 2021Publication History
Skip Abstract Section

Abstract

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.

In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.

References

  1. A. Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words (NATO ISI Series). Springer-Verlag, 85--96.Google ScholarGoogle Scholar
  2. D. Belazzougui, M. Cáceres, T. Gagie, P. Gawrychowski, J. Kärkkäinen, G. Navarro, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2021. Block Trees. Journal of Computer and System Sciences 117 (2021), 1--22.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Belazzougui and F. Cunial. 2017. Representing the suffix tree with the CDAWG. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 7:1--7:13.Google ScholarGoogle Scholar
  4. D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2015a. Composite repetition-aware data structures. In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 26--39.Google ScholarGoogle Scholar
  5. D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2017. Flexible indexing of repetitive collections. In Proceedings of the 13th Conference on Computability in Europe (CiE’17). 162--174.Google ScholarGoogle Scholar
  6. D. Belazzougui, T. Gagie, P. Gawrychowski, J. Kärkkäinen, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2015b. Queries on lz-bounded encodings. In Proceedings of the 25th Data Compression Conference (DCC’15). 83--92.Google ScholarGoogle Scholar
  7. D. Belazzougui and G. Navarro. 2015. Optimal lower and upper bounds for representing sequences. ACM Trans. Algor. 11, 4 (2015), article 31.Google ScholarGoogle Scholar
  8. D. Belazzougui, S. J. Puglisi, and Y. Tabei. 2015c. Access, rank, select in grammar-compressed strings. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA’15). 142--154.Google ScholarGoogle Scholar
  9. T. C. Bell, J. Cleary, and I. H. Witten. 1990. Text Compression. Prentice Hall.Google ScholarGoogle Scholar
  10. M. Bender and M. Farach-Colton. 2004. The level ancestor problem simplified. Theoret. Comput. Sci. 321, 1 (2004), 5--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Bentley, D. Gibney, and S. V. Thankachan. 2019. On the complexity of BWT-runs minimization via alphabet reordering. CoRR 1911.03035.Google ScholarGoogle Scholar
  12. P. Bille, T. Gagie, I. Li Gørtz, and N. Prezza. 2018. A separation between RLSLPs and LZ77. J. Discrete Algor. 50 (2018), 36--39.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Bille, I. L. Gørtz, P. H. Cording, B. Sach, H. W. Vildhøj, and S. Vind. 2017. Fingerprints in compressed strings. J. Comput. Syst. Sci. 86 (2017), 171--180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Bille and I. L. Gørtz. 2020. Random access in persistent strings. CoRR 2006.15575.Google ScholarGoogle Scholar
  15. P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. S. Rao, and O. Weimann. 2015. Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513--539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Blumer, J. Blumer, D. Haussler, R. M. McConnell, and A. Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. J. ACM 34, 3 (1987), 578--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Burrows and D. Wheeler. 1994. A Block Sorting Lossless Data Compression Algorithm. Technical Report 124. Digital Equipment Corporation.Google ScholarGoogle Scholar
  18. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Rasala, A. Sahai, and A. Shelat. 2002. Approximating the smallest grammar: Kolmogorov complexity in natural models. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02). 792--801.Google ScholarGoogle Scholar
  19. M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. 2005. The smallest grammar problem. IEEE Trans. Info. Theory 51, 7 (2005), 2554--2576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. 2020. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms 17, 1, Article 8 (2020).Google ScholarGoogle Scholar
  21. F. Claude, A. Fariña, M. Martínez-Prieto, and G. Navarro. 2016. Universal indexes for highly repetitive document collections. Info. Syst. 61 (2016), 1--23.Google ScholarGoogle Scholar
  22. T. Cover and J. Thomas. 2006. Elements of Information Theory (2nd ed.). Wiley.Google ScholarGoogle Scholar
  23. M. Crochemore, C. S. Iliopoulos, M. Kubica, W. Rytter, and T. Waleń. 2012. Efficient algorithms for three variants of the LPF table. J. Discrete Algor. 11 (2012), 51--61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Dinklage, J. Fischer, D. Köppl, M. Löbel, and K. Sadakane. 2017. Compression with the tudocomp framework. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17).Google ScholarGoogle Scholar
  25. J. Driscoll, N. Sarnak, D. Sleator, and R. E. Tarjan. 1989. Making data structures persistent. J. Comput. Syst. Sci. 38 (1989), 86--124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Elsayed and D. W. Oard. 2006. Modeling identity in archival collections of email: A preliminary study. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS’06).Google ScholarGoogle Scholar
  27. M. Farach and M. Thorup. 1998. String matching in Lempel-Ziv compressed strings. Algorithmica 20, 4 (1998), 388--404.Google ScholarGoogle ScholarCross RefCross Ref
  28. P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (2005), 688--713.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Ferragina and G. Manzini. 2005. Indexing compressed texts. J. ACM 52, 4 (2005), 552--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Fischer, T. I. D. Köppl, and K. Sadakane. 2018. Lempel-ziv factorization powered by space efficient suffix trees. Algorithmica 80, 7 (2018), 2048--2081.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. H.-Y. Fritz, R. Leinonen, G. Cochrane, and E. Birney. 2011. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. (2011), 734--740.Google ScholarGoogle Scholar
  32. T. Gagie. 2006. Large alphabets and incompressibility. Inform. Process. Lett. 99, 6 (2006), 246--251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2012. A faster grammar-based self-index. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12). 240--251.Google ScholarGoogle Scholar
  34. T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2014. LZ77-based self-indexing with faster pattern matching. In Proceedings of the 11th Latin American Symposium on Theoretical Informatics (LATIN’14). 731--742.Google ScholarGoogle Scholar
  35. T. Gagie, G. Navarro, and N. Prezza. 2018. On the approximation ratio of lempel-ziv parsing. In Proceedings of the 13th Latin American Symposium on Theoretical Informatics (LATIN’18). 490--503.Google ScholarGoogle Scholar
  36. T. Gagie, G. Navarro, and N. Prezza. 2020. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), article 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. K. Gallant. 1982. String Compression Algorithms. Ph.D. Dissertation. Princeton University.Google ScholarGoogle Scholar
  38. M. Ganardi, A. Jeż, and M. Lohrey. 2019. Balancing straight-line programs. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS’19). 1169--1183.Google ScholarGoogle Scholar
  39. L. Gasieniec, M. Karpinski, W. Plandowski, and W. Rytter. 1996. Efficient algorithms for lempel-ziv encoding. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory (SWAT’96). 392--403.Google ScholarGoogle Scholar
  40. L. Gasieniec, R. Kolpakov, I. Potapov, and P. Sant. 2005. Real-time traversal in grammar-based compressed files. In Proceedings of the 15th Data Compression Conference (DCC’05). 458--458.Google ScholarGoogle Scholar
  41. P. Gawrychowski. 2011. Pattern matching in lempel-ziv compressed strings: Fast, simple, and deterministic. In Proceedings of the 19th Annual European Symposium on Algorithms (ESA’11). 421--432.Google ScholarGoogle ScholarCross RefCross Ref
  42. S. Giuliani, S. Inenaga, Z. Lipták, N. Prezza, M. Sciortino, and A. Toffanello. 2020. Novel results on the number of runs of the burrows-wheeler-transform. CoRR 2008.08506.Google ScholarGoogle Scholar
  43. S. Gog, J. Kärkkäinen, D. Kempa, M. Petri, and S. J. Puglisi. 2019. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica 81, 4 (2019), 1370--1391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. R. Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’06). 284--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D. Hucke, M. Lohrey, and C. P. Reh. 2016. The smallest grammar problem revisited. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE’16). 35--49.Google ScholarGoogle Scholar
  46. G. Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS’89). 549--554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Jeż. 2015. Approximation of grammar-based compression via recompression. Theoret. Comput. Sci. 592 (2015), 115--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Jeż. 2016. A really simple approximation of smallest grammar. Theoret. Comput. Sci. 616 (2016), 141--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. C. Kapser and M. W. Godfrey. 2005. Improved tool support for the investigation of duplication in software. In Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM’05). 305--314.Google ScholarGoogle Scholar
  50. J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2012. Slashing the time for BWT inversion. In Proceedings of the 22nd Data Compression Conference (DCC’12). 99--108.Google ScholarGoogle Scholar
  51. J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2016. Lazy lempel-ziv factorization algorithms. ACM J. Exper. Algor. 21, 1 (2016), 2.4:1--2.4:19.Google ScholarGoogle Scholar
  52. J. Kärkkäinen and S. J. Puglisi. 2010. Medium-space algorithms for inverse BWT. In Proceedings of the 18th Annual European Symposium on Algorithms (ESA’10). 451--462.Google ScholarGoogle Scholar
  53. J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. M. Karp and M. O. Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 2 (1987), 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01). 181--192.Google ScholarGoogle Scholar
  56. D. Kempa and T. Kociumaka. 2019. Resolution of the burrows-wheeler transform conjecture. In Proceedings of the IEEE Computer Society Technical Committee on Mathematical Foundations of Computing (FOCS’20). 1002--1013.Google ScholarGoogle Scholar
  57. D. Kempa and N. Prezza. 2018. At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing (STOC’18). 827--840.Google ScholarGoogle Scholar
  58. T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. 2003. Collage system: A unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 1 (2003), 253--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. J. C. Kieffer and E.-H. Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Info. Theory 46, 3 (2000), 737--754.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. K. Kim, J. S. Sim, H. Park, and K. Park. 2005. Constructing suffix arrays in linear time. J. Discrete Algor. 3, 2--4 (2005), 126--142.Google ScholarGoogle ScholarCross RefCross Ref
  61. P. Ko and S. Aluru. 2005. Space efficient linear time construction of suffix arrays. J. Discrete Algor. 3, 2--4 (2005), 143--156.Google ScholarGoogle ScholarCross RefCross Ref
  62. T. Kociumaka, G. Navarro, and N. Prezza. 2020. Towards a definitive measure of repetitiveness. In Proceedings of the 14th Latin American Symposium on Theoretical Informatics (LATIN’20). Lecture Notes in Computer Science, Vol. 12118. 207--219.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. T. Kociumaka, G. Navarro, and N. Prezza. 2021. Towards a definitive compressibility measure for repetitive sequences. CoRR 1910.02151.Google ScholarGoogle Scholar
  64. A. N. Kolmogorov. 1965. Three approaches to the quantitative definition of information. Prob. Info. Trans. 1, 1 (1965), 1--7.Google ScholarGoogle Scholar
  65. R. Kosaraju and G. Manzini. 2000. Compression of low entropy strings with lempel-ziv algorithms. SIAM J. Comput. 29, 3 (2000), 893--911.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theoret. Comput. Sci. 483 (2013), 115--133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. K. Kutsukake, T. Matsumoto, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. 2020. On repetitiveness measures of Thue-Morse words. In Proceedings of the 27th International Symposium on String Processing and Information Retrieval (SPIRE’20). 213--220.Google ScholarGoogle Scholar
  68. J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.Google ScholarGoogle ScholarCross RefCross Ref
  69. A. Lempel and J. Ziv. 1976. On the complexity of finite sequences. IEEE Trans. Info. Theory 22, 1 (1976), 75--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. V. Mäkinen and G. Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nordic J. Comput. 12, 1 (2005), 40--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281--308.Google ScholarGoogle ScholarCross RefCross Ref
  72. U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. S. Mantaci, A. Restivo, G. Romana, G. Rosone, and M. Sciortino. 2021. A combinatorial view on string attractors. Theoretical Computer Science 850 (2021), 236--248.Google ScholarGoogle ScholarCross RefCross Ref
  74. G. Manzini. 2001. An analysis of the burrows-wheeler transform. J. ACM 48, 3 (2001), 407--430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. E. McCreight. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (1976), 262--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. G. Navarro. 2016. Compact Data Structures—A Practical Approach. Cambridge University Press.Google ScholarGoogle Scholar
  77. G. Navarro. 2019. Document listing on repetitive collections with guaranteed performance. Theoret. Comput. Sci. 777 (2019), 58--72.Google ScholarGoogle ScholarCross RefCross Ref
  78. G. Navarro. 2021. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys 54, 2, Article 26 (2021).Google ScholarGoogle Scholar
  79. G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surveys 39, 1 (2007), article 2.Google ScholarGoogle Scholar
  80. G. Navarro and N. Prezza. 2019. Universal compressed text indexing. Theoret. Comput. Sci. 762 (2019), 41--50.Google ScholarGoogle ScholarCross RefCross Ref
  81. G. Navarro, N. Prezza, and C. Ochoa. 2021. On the approximation ratio of greedy parsings. IEEE Transactions on Information Theory 67, 2 (2021), 1008--1026.Google ScholarGoogle ScholarCross RefCross Ref
  82. G. Navarro and J. Rojas-Ledesma. 2020. Predecessor search. Comput. Surveys 53, 5 (2020), article 105.Google ScholarGoogle Scholar
  83. C. Nevill-Manning, I. Witten, and D. Maulsby. 1994. Compression by induction of hierarchical grammars. In Proceedings of the 4th Data Compression Conference (DCC’94). 244--253.Google ScholarGoogle Scholar
  84. T. Nishimoto, T. I. S. Inenaga, H. Bannai, and M. Takeda. 2016. Fully dynamic data structure for LCE queries in compressed space. In Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science (MFCS’16). 72:1--72:15.Google ScholarGoogle Scholar
  85. T. Nishimoto and Y. Tabei. 2019. LZRR: LZ77 parsing with right reference. In Proceedings of the 29th Data Compression Conference (DCC’19). 211--220.Google ScholarGoogle Scholar
  86. C. Ochoa and G. Navarro. 2019. RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Trans. Info. Theory 65, 5 (2019), 3160--3164.Google ScholarGoogle ScholarCross RefCross Ref
  87. N. Prezza. 2016. Compressed Computation for Text Indexing. Ph.D. Dissertation. University of Udine.Google ScholarGoogle Scholar
  88. M. Przeworski, R. R. Hudson, and A. Di Rienzo. 2000. Adjusting the focus on human variation. Trends Genet. 16, 7 (2000), 296--302.Google ScholarGoogle ScholarCross RefCross Ref
  89. S. Raskhodnikova, D. Ron, R. Rubinfeld, and A. D. Smith. 2013. Sublinear algorithms for approximating string compressibility. Algorithmica 65, 3 (2013), 685--709.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. M. Rodeh, V. R. Pratt, and S. Even. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (1981), 16--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. F. Rubin. 1976. Experiments in text file compression. Commun. ACM 19, 11 (1976), 617--623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. L. M. S. Russo, A. Correia, G. Navarro, and A. P. Francisco. 2020. Approximating optimal bidirectional macro schemes. In Proceedings of the 30th Data Compression Conference (DCC’20). 153--162.Google ScholarGoogle Scholar
  93. W. Rytter. 2003. Application of lempel-ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302, 1--3 (2003), 211--222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. H. Sakamoto. 2005. A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algor. 3, 24 (2005), 416--430.Google ScholarGoogle ScholarCross RefCross Ref
  95. N. Sarnak and R. E. Tarjan. 1986. Planar point location using persistent search trees. Commun. ACM 29, 7 (1986), 669--679.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. C. E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27 (1948), 398--403.Google ScholarGoogle ScholarCross RefCross Ref
  97. D. D. Sleator and R. E. Tarjan. 1983. A data structure for dynamic trees. J. Comput. Syst. Sci. 26, 3 (1983), 362--391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, Z. Chenxiang, M. J. Efron, R. Iyer, S. Sinha, and G. E. Robinson. 2015. Big data: Astronomical or genomical? PLoS Biol. 17, 7 (2015), e1002195.Google ScholarGoogle ScholarCross RefCross Ref
  99. J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928--951.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. 2013. Groundhog day: Near-duplicate detection on twitter. In Proceedings of the 22nd International World Wide Web Conference (WWW’13). 1273--1284.Google ScholarGoogle Scholar
  101. E. Verbin and W. Yu. 2013. Data structure lower bounds on random access to grammar-compressed strings. In Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM’13). 247--258.Google ScholarGoogle Scholar
  102. P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (FOCS’73). 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. I. H. Witten, R. M. Neal, and J. G. Cleary. 1987. Arithmetic coding for data compression. Commun. ACM 30 (1987), 520--540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23, 3 (1977), 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Computing Surveys
                  ACM Computing Surveys  Volume 54, Issue 2
                  March 2022
                  800 pages
                  ISSN:0360-0300
                  EISSN:1557-7341
                  DOI:10.1145/3450359
                  Issue’s Table of Contents

                  Copyright © 2021 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 5 March 2021
                  • Revised: 1 November 2020
                  • Accepted: 1 November 2020
                  • Received: 1 April 2020
                  Published in csur Volume 54, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format