Compressed String Dictionary Search with Edit Distance One

Belazzougui, Djamal; Venturini, Rossano

doi:10.1007/s00453-015-9990-0

Compressed String Dictionary Search with Edit Distance One

Published: 25 March 2015

Volume 74, pages 1099–1122, (2016)
Cite this article

Algorithmica Aims and scope Submit manuscript

Djamal Belazzougui¹ &
Rossano Venturini²

342 Accesses
4 Citations
2 Altmetric
Explore all metrics

Abstract

In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern \(P\), the index has to report all the strings in the dictionary having edit distance at most one with \(P\). Our first solution is able to solve queries in (almost optimal) \(O(|P|+occ)\) time where \(occ\) is the number of strings in the dictionary having edit distance at most one with \(P\). The space complexity of this solution is bounded in terms of the \(k\)th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dictionary Matching with a Bounded Gap in Pattern or in Text

Article 08 February 2017

Lempel–Ziv-78 Compressed String Dictionaries

Article 26 July 2017

Fast String Dictionary Lookup with One Error

Notes

However, they can be easily extended to deal with the more general edit distance.
Actually, the paper [11] described only a solution for binary alphabet. However, it is not hard to obtain the claimed space and time complexities also for non-constant alphabet sizes.
Notice that just accessing each symbol of these candidate strings would cost \(O(p + p \cdot occ)\) time in total which is much higher than our claimed complexity.
Recall that we are still assuming that we can check in \(O(1)\) whether a candidate string belongs to \( D \).
Observe that similar considerations hold also for substitutions with the difference that we skip the \(i\)th symbol in factorizations of the form \(P=P[1,i-1] \cdot P[i] \cdot P[i+1, p]\).
Checks for other types of errors are done in a similar way.
We notice that the number of distinct lengths and, thus, compressed permuterm indexes is \(O(\sqrt{n})\).
Notice that this case occurs only when \(P_iP[i]\in D \). In order to properly deal with this case, the value of \(\ell \) is not increased after a successful backward step if it already reached the maximal value \(p+2\).
If we have false positives then the same character may be checked and thus potentially reported twice. To avoid this case, we can use a dynamic hash table at query time which stores all the characters reported so-far. Whenever we find that a character has been already reported, then the query stops and does not report more characters, since a correct query answer can not return the same character twice at the same position.

References

Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)
Article MathSciNet MATH Google Scholar
Barbay, J., He, M., Munro, J.I., Satti, S.R.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)
Article MathSciNet Google Scholar
Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)
Article MathSciNet MATH Google Scholar
Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 154–167 (2009)
Belazzougui, D.: Improved space-time tradeoffs for approximate full-text indexing with one edit error. Algorithmica (2014). doi:10.1007/s00453-014-9873-9
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA), pp. 181–192 (2012)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23 (2014)
Article MathSciNet Google Scholar
Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 280–292 (2012)
Belazzougui, D., Venturini, R.: Compressed static functions with applications. In: Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 229–240 (2013)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 65–74. Springer (1996)
Brodal, G.S., Srinivasan, V.: Improved bounds for dictionary look-up with one error. Inf. Process. Lett. 75(1–2), 57–59 (2000)
Article Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The Bloomier filter: an efficient data structure for static support lookup tables. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 30–39 (2004)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable (extended abstract). In: Proceeding of the 19th International Colloquium on Automata, Languages and Programming (ICALP), pp. 235–246 (1992)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21, 246–260 (1974)
Article MathSciNet MATH Google Scholar
Fano, RM.: On the number of bits required to implement anassociative memory. Memorandum 61, Computer Structures Group, Project MAC (1971)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 12 (2008)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)
Article MathSciNet MATH Google Scholar
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10 (2010)
Article MathSciNet Google Scholar
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Article MathSciNet MATH Google Scholar
Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Article MathSciNet MATH Google Scholar
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Article Google Scholar
Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: SPIRE, pp. 347–358 (2010)
Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 823–829 (2005)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Article MathSciNet Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Article MathSciNet MATH Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Burlington (1999)
Google Scholar
Yao, A.C.-C., Yao, F.F.: Dictionary look-up with one error. J. Algorithms 25(1), 194–202 (1997)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
Djamal Belazzougui
Department of Computer Science, University of Pisa, Pisa, Italy
Rossano Venturini

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Rossano Venturini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rossano Venturini.

Additional information

The work is an extended version of the paper [8] appeared in Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching, 2012. This work has been partially supported by Academy of Finland under Grant 250345 (CoECGR), the French ANR-2010-COSI-004 MAPPI project, PRIN ARS Technomedia 2012, the Midas EU Project, Grant Agreement No. 318786, and the eCloud EU Project, Grant Agreement No. 325091.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belazzougui, D., Venturini, R. Compressed String Dictionary Search with Edit Distance One. Algorithmica 74, 1099–1122 (2016). https://doi.org/10.1007/s00453-015-9990-0

Download citation

Received: 11 August 2013
Accepted: 17 March 2015
Published: 25 March 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s00453-015-9990-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressed String Dictionary Search with Edit Distance One

Abstract

Access this article

Similar content being viewed by others

Dictionary Matching with a Bounded Gap in Pattern or in Text

Lempel–Ziv-78 Compressed String Dictionaries

Fast String Dictionary Lookup with One Error

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compressed String Dictionary Search with Edit Distance One

Abstract

Access this article

Similar content being viewed by others

Dictionary Matching with a Bounded Gap in Pattern or in Text

Lempel–Ziv-78 Compressed String Dictionaries

Fast String Dictionary Lookup with One Error

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation