Abstract
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.
In this first part, we describe the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings. In the quest for an ideal measure of repetitiveness, we uncover a fascinating web of relations between those measures, as well as the limits up to which the data can be recovered, and up to which direct access to the compressed data can be provided. This is the basic aspect of indexability, which is covered in the second part of this survey.
- A. Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words (NATO ISI Series). Springer-Verlag, 85--96.Google Scholar
- D. Belazzougui, M. Cáceres, T. Gagie, P. Gawrychowski, J. Kärkkäinen, G. Navarro, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2021. Block Trees. Journal of Computer and System Sciences 117 (2021), 1--22.Google ScholarCross Ref
- D. Belazzougui and F. Cunial. 2017. Representing the suffix tree with the CDAWG. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 7:1--7:13.Google Scholar
- D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2015a. Composite repetition-aware data structures. In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 26--39.Google Scholar
- D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2017. Flexible indexing of repetitive collections. In Proceedings of the 13th Conference on Computability in Europe (CiE’17). 162--174.Google Scholar
- D. Belazzougui, T. Gagie, P. Gawrychowski, J. Kärkkäinen, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2015b. Queries on lz-bounded encodings. In Proceedings of the 25th Data Compression Conference (DCC’15). 83--92.Google Scholar
- D. Belazzougui and G. Navarro. 2015. Optimal lower and upper bounds for representing sequences. ACM Trans. Algor. 11, 4 (2015), article 31.Google Scholar
- D. Belazzougui, S. J. Puglisi, and Y. Tabei. 2015c. Access, rank, select in grammar-compressed strings. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA’15). 142--154.Google Scholar
- T. C. Bell, J. Cleary, and I. H. Witten. 1990. Text Compression. Prentice Hall.Google Scholar
- M. Bender and M. Farach-Colton. 2004. The level ancestor problem simplified. Theoret. Comput. Sci. 321, 1 (2004), 5--12.Google ScholarDigital Library
- J. Bentley, D. Gibney, and S. V. Thankachan. 2019. On the complexity of BWT-runs minimization via alphabet reordering. CoRR 1911.03035.Google Scholar
- P. Bille, T. Gagie, I. Li Gørtz, and N. Prezza. 2018. A separation between RLSLPs and LZ77. J. Discrete Algor. 50 (2018), 36--39.Google ScholarCross Ref
- P. Bille, I. L. Gørtz, P. H. Cording, B. Sach, H. W. Vildhøj, and S. Vind. 2017. Fingerprints in compressed strings. J. Comput. Syst. Sci. 86 (2017), 171--180.Google ScholarDigital Library
- P. Bille and I. L. Gørtz. 2020. Random access in persistent strings. CoRR 2006.15575.Google Scholar
- P. Bille, G. M. Landau, R. Raman, K. Sadakane, S. S. Rao, and O. Weimann. 2015. Random access to grammar-compressed strings and trees. SIAM J. Comput. 44, 3 (2015), 513--539.Google ScholarDigital Library
- A. Blumer, J. Blumer, D. Haussler, R. M. McConnell, and A. Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. J. ACM 34, 3 (1987), 578--595.Google ScholarDigital Library
- M. Burrows and D. Wheeler. 1994. A Block Sorting Lossless Data Compression Algorithm. Technical Report 124. Digital Equipment Corporation.Google Scholar
- M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Rasala, A. Sahai, and A. Shelat. 2002. Approximating the smallest grammar: Kolmogorov complexity in natural models. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02). 792--801.Google Scholar
- M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. 2005. The smallest grammar problem. IEEE Trans. Info. Theory 51, 7 (2005), 2554--2576.Google ScholarDigital Library
- A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. 2020. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms 17, 1, Article 8 (2020).Google Scholar
- F. Claude, A. Fariña, M. Martínez-Prieto, and G. Navarro. 2016. Universal indexes for highly repetitive document collections. Info. Syst. 61 (2016), 1--23.Google Scholar
- T. Cover and J. Thomas. 2006. Elements of Information Theory (2nd ed.). Wiley.Google Scholar
- M. Crochemore, C. S. Iliopoulos, M. Kubica, W. Rytter, and T. Waleń. 2012. Efficient algorithms for three variants of the LPF table. J. Discrete Algor. 11 (2012), 51--61.Google ScholarDigital Library
- P. Dinklage, J. Fischer, D. Köppl, M. Löbel, and K. Sadakane. 2017. Compression with the tudocomp framework. In Proceedings of the 16th International Symposium on Experimental Algorithms (SEA’17).Google Scholar
- J. Driscoll, N. Sarnak, D. Sleator, and R. E. Tarjan. 1989. Making data structures persistent. J. Comput. Syst. Sci. 38 (1989), 86--124.Google ScholarDigital Library
- T. Elsayed and D. W. Oard. 2006. Modeling identity in archival collections of email: A preliminary study. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS’06).Google Scholar
- M. Farach and M. Thorup. 1998. String matching in Lempel-Ziv compressed strings. Algorithmica 20, 4 (1998), 388--404.Google ScholarCross Ref
- P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (2005), 688--713.Google ScholarDigital Library
- P. Ferragina and G. Manzini. 2005. Indexing compressed texts. J. ACM 52, 4 (2005), 552--581.Google ScholarDigital Library
- J. Fischer, T. I. D. Köppl, and K. Sadakane. 2018. Lempel-ziv factorization powered by space efficient suffix trees. Algorithmica 80, 7 (2018), 2048--2081.Google ScholarDigital Library
- M. H.-Y. Fritz, R. Leinonen, G. Cochrane, and E. Birney. 2011. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. (2011), 734--740.Google Scholar
- T. Gagie. 2006. Large alphabets and incompressibility. Inform. Process. Lett. 99, 6 (2006), 246--251.Google ScholarDigital Library
- T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2012. A faster grammar-based self-index. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12). 240--251.Google Scholar
- T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2014. LZ77-based self-indexing with faster pattern matching. In Proceedings of the 11th Latin American Symposium on Theoretical Informatics (LATIN’14). 731--742.Google Scholar
- T. Gagie, G. Navarro, and N. Prezza. 2018. On the approximation ratio of lempel-ziv parsing. In Proceedings of the 13th Latin American Symposium on Theoretical Informatics (LATIN’18). 490--503.Google Scholar
- T. Gagie, G. Navarro, and N. Prezza. 2020. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), article 2.Google ScholarDigital Library
- J. K. Gallant. 1982. String Compression Algorithms. Ph.D. Dissertation. Princeton University.Google Scholar
- M. Ganardi, A. Jeż, and M. Lohrey. 2019. Balancing straight-line programs. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS’19). 1169--1183.Google Scholar
- L. Gasieniec, M. Karpinski, W. Plandowski, and W. Rytter. 1996. Efficient algorithms for lempel-ziv encoding. In Proceedings of the 5th Scandinavian Workshop on Algorithm Theory (SWAT’96). 392--403.Google Scholar
- L. Gasieniec, R. Kolpakov, I. Potapov, and P. Sant. 2005. Real-time traversal in grammar-based compressed files. In Proceedings of the 15th Data Compression Conference (DCC’05). 458--458.Google Scholar
- P. Gawrychowski. 2011. Pattern matching in lempel-ziv compressed strings: Fast, simple, and deterministic. In Proceedings of the 19th Annual European Symposium on Algorithms (ESA’11). 421--432.Google ScholarCross Ref
- S. Giuliani, S. Inenaga, Z. Lipták, N. Prezza, M. Sciortino, and A. Toffanello. 2020. Novel results on the number of runs of the burrows-wheeler-transform. CoRR 2008.08506.Google Scholar
- S. Gog, J. Kärkkäinen, D. Kempa, M. Petri, and S. J. Puglisi. 2019. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica 81, 4 (2019), 1370--1391.Google ScholarDigital Library
- M. R. Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’06). 284--291.Google ScholarDigital Library
- D. Hucke, M. Lohrey, and C. P. Reh. 2016. The smallest grammar problem revisited. In Proceedings of the 23rd International Symposium on String Processing and Information Retrieval (SPIRE’16). 35--49.Google Scholar
- G. Jacobson. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS’89). 549--554.Google ScholarDigital Library
- A. Jeż. 2015. Approximation of grammar-based compression via recompression. Theoret. Comput. Sci. 592 (2015), 115--134.Google ScholarDigital Library
- A. Jeż. 2016. A really simple approximation of smallest grammar. Theoret. Comput. Sci. 616 (2016), 141--150.Google ScholarDigital Library
- C. Kapser and M. W. Godfrey. 2005. Improved tool support for the investigation of duplication in software. In Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM’05). 305--314.Google Scholar
- J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2012. Slashing the time for BWT inversion. In Proceedings of the 22nd Data Compression Conference (DCC’12). 99--108.Google Scholar
- J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2016. Lazy lempel-ziv factorization algorithms. ACM J. Exper. Algor. 21, 1 (2016), 2.4:1--2.4:19.Google Scholar
- J. Kärkkäinen and S. J. Puglisi. 2010. Medium-space algorithms for inverse BWT. In Proceedings of the 18th Annual European Symposium on Algorithms (ESA’10). 451--462.Google Scholar
- J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.Google ScholarDigital Library
- R. M. Karp and M. O. Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 2 (1987), 249--260.Google ScholarDigital Library
- T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01). 181--192.Google Scholar
- D. Kempa and T. Kociumaka. 2019. Resolution of the burrows-wheeler transform conjecture. In Proceedings of the IEEE Computer Society Technical Committee on Mathematical Foundations of Computing (FOCS’20). 1002--1013.Google Scholar
- D. Kempa and N. Prezza. 2018. At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing (STOC’18). 827--840.Google Scholar
- T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. 2003. Collage system: A unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 1 (2003), 253--272.Google ScholarDigital Library
- J. C. Kieffer and E.-H. Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Info. Theory 46, 3 (2000), 737--754.Google ScholarDigital Library
- D. K. Kim, J. S. Sim, H. Park, and K. Park. 2005. Constructing suffix arrays in linear time. J. Discrete Algor. 3, 2--4 (2005), 126--142.Google ScholarCross Ref
- P. Ko and S. Aluru. 2005. Space efficient linear time construction of suffix arrays. J. Discrete Algor. 3, 2--4 (2005), 143--156.Google ScholarCross Ref
- T. Kociumaka, G. Navarro, and N. Prezza. 2020. Towards a definitive measure of repetitiveness. In Proceedings of the 14th Latin American Symposium on Theoretical Informatics (LATIN’20). Lecture Notes in Computer Science, Vol. 12118. 207--219.Google ScholarDigital Library
- T. Kociumaka, G. Navarro, and N. Prezza. 2021. Towards a definitive compressibility measure for repetitive sequences. CoRR 1910.02151.Google Scholar
- A. N. Kolmogorov. 1965. Three approaches to the quantitative definition of information. Prob. Info. Trans. 1, 1 (1965), 1--7.Google Scholar
- R. Kosaraju and G. Manzini. 2000. Compression of low entropy strings with lempel-ziv algorithms. SIAM J. Comput. 29, 3 (2000), 893--911.Google ScholarDigital Library
- S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theoret. Comput. Sci. 483 (2013), 115--133.Google ScholarDigital Library
- K. Kutsukake, T. Matsumoto, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. 2020. On repetitiveness measures of Thue-Morse words. In Proceedings of the 27th International Symposium on String Processing and Information Retrieval (SPIRE’20). 213--220.Google Scholar
- J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.Google ScholarCross Ref
- A. Lempel and J. Ziv. 1976. On the complexity of finite sequences. IEEE Trans. Info. Theory 22, 1 (1976), 75--81.Google ScholarDigital Library
- V. Mäkinen and G. Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nordic J. Comput. 12, 1 (2005), 40--66.Google ScholarDigital Library
- V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281--308.Google ScholarCross Ref
- U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.Google ScholarDigital Library
- S. Mantaci, A. Restivo, G. Romana, G. Rosone, and M. Sciortino. 2021. A combinatorial view on string attractors. Theoretical Computer Science 850 (2021), 236--248.Google ScholarCross Ref
- G. Manzini. 2001. An analysis of the burrows-wheeler transform. J. ACM 48, 3 (2001), 407--430.Google ScholarDigital Library
- E. McCreight. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (1976), 262--272.Google ScholarDigital Library
- G. Navarro. 2016. Compact Data Structures—A Practical Approach. Cambridge University Press.Google Scholar
- G. Navarro. 2019. Document listing on repetitive collections with guaranteed performance. Theoret. Comput. Sci. 777 (2019), 58--72.Google ScholarCross Ref
- G. Navarro. 2021. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys 54, 2, Article 26 (2021).Google Scholar
- G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surveys 39, 1 (2007), article 2.Google Scholar
- G. Navarro and N. Prezza. 2019. Universal compressed text indexing. Theoret. Comput. Sci. 762 (2019), 41--50.Google ScholarCross Ref
- G. Navarro, N. Prezza, and C. Ochoa. 2021. On the approximation ratio of greedy parsings. IEEE Transactions on Information Theory 67, 2 (2021), 1008--1026.Google ScholarCross Ref
- G. Navarro and J. Rojas-Ledesma. 2020. Predecessor search. Comput. Surveys 53, 5 (2020), article 105.Google Scholar
- C. Nevill-Manning, I. Witten, and D. Maulsby. 1994. Compression by induction of hierarchical grammars. In Proceedings of the 4th Data Compression Conference (DCC’94). 244--253.Google Scholar
- T. Nishimoto, T. I. S. Inenaga, H. Bannai, and M. Takeda. 2016. Fully dynamic data structure for LCE queries in compressed space. In Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science (MFCS’16). 72:1--72:15.Google Scholar
- T. Nishimoto and Y. Tabei. 2019. LZRR: LZ77 parsing with right reference. In Proceedings of the 29th Data Compression Conference (DCC’19). 211--220.Google Scholar
- C. Ochoa and G. Navarro. 2019. RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Trans. Info. Theory 65, 5 (2019), 3160--3164.Google ScholarCross Ref
- N. Prezza. 2016. Compressed Computation for Text Indexing. Ph.D. Dissertation. University of Udine.Google Scholar
- M. Przeworski, R. R. Hudson, and A. Di Rienzo. 2000. Adjusting the focus on human variation. Trends Genet. 16, 7 (2000), 296--302.Google ScholarCross Ref
- S. Raskhodnikova, D. Ron, R. Rubinfeld, and A. D. Smith. 2013. Sublinear algorithms for approximating string compressibility. Algorithmica 65, 3 (2013), 685--709.Google ScholarDigital Library
- M. Rodeh, V. R. Pratt, and S. Even. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (1981), 16--24.Google ScholarDigital Library
- F. Rubin. 1976. Experiments in text file compression. Commun. ACM 19, 11 (1976), 617--623.Google ScholarDigital Library
- L. M. S. Russo, A. Correia, G. Navarro, and A. P. Francisco. 2020. Approximating optimal bidirectional macro schemes. In Proceedings of the 30th Data Compression Conference (DCC’20). 153--162.Google Scholar
- W. Rytter. 2003. Application of lempel-ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302, 1--3 (2003), 211--222.Google ScholarDigital Library
- H. Sakamoto. 2005. A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algor. 3, 24 (2005), 416--430.Google ScholarCross Ref
- N. Sarnak and R. E. Tarjan. 1986. Planar point location using persistent search trees. Commun. ACM 29, 7 (1986), 669--679.Google ScholarDigital Library
- C. E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Techn. J. 27 (1948), 398--403.Google ScholarCross Ref
- D. D. Sleator and R. E. Tarjan. 1983. A data structure for dynamic trees. J. Comput. Syst. Sci. 26, 3 (1983), 362--391.Google ScholarDigital Library
- Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, Z. Chenxiang, M. J. Efron, R. Iyer, S. Sinha, and G. E. Robinson. 2015. Big data: Astronomical or genomical? PLoS Biol. 17, 7 (2015), e1002195.Google ScholarCross Ref
- J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928--951.Google ScholarDigital Library
- K. Tao, F. Abel, C. Hauff, G.-J. Houben, and U. Gadiraju. 2013. Groundhog day: Near-duplicate detection on twitter. In Proceedings of the 22nd International World Wide Web Conference (WWW’13). 1273--1284.Google Scholar
- E. Verbin and W. Yu. 2013. Data structure lower bounds on random access to grammar-compressed strings. In Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM’13). 247--258.Google Scholar
- P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (FOCS’73). 1--11.Google ScholarDigital Library
- I. H. Witten, R. M. Neal, and J. G. Cleary. 1987. Arithmetic coding for data compression. Commun. ACM 30 (1987), 520--540.Google ScholarDigital Library
- J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23, 3 (1977), 337--343.Google ScholarDigital Library
Index Terms
- Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures
Recommendations
Indexing Highly Repetitive String Collections, Part II: Compressed Indexes
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like ...
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ ...
Document retrieval on repetitive string collections
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can ...
Comments