Abstract
In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern \(P\), the index has to report all the strings in the dictionary having edit distance at most one with \(P\). Our first solution is able to solve queries in (almost optimal) \(O(|P|+occ)\) time where \(occ\) is the number of strings in the dictionary having edit distance at most one with \(P\). The space complexity of this solution is bounded in terms of the \(k\)th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one.
Similar content being viewed by others
Notes
However, they can be easily extended to deal with the more general edit distance.
Actually, the paper [11] described only a solution for binary alphabet. However, it is not hard to obtain the claimed space and time complexities also for non-constant alphabet sizes.
Notice that just accessing each symbol of these candidate strings would cost \(O(p + p \cdot occ)\) time in total which is much higher than our claimed complexity.
Recall that we are still assuming that we can check in \(O(1)\) whether a candidate string belongs to \( D \).
Observe that similar considerations hold also for substitutions with the difference that we skip the \(i\)th symbol in factorizations of the form \(P=P[1,i-1] \cdot P[i] \cdot P[i+1, p]\).
Checks for other types of errors are done in a similar way.
We notice that the number of distinct lengths and, thus, compressed permuterm indexes is \(O(\sqrt{n})\).
Notice that this case occurs only when \(P_iP[i]\in D \). In order to properly deal with this case, the value of \(\ell \) is not increased after a successful backward step if it already reached the maximal value \(p+2\).
If we have false positives then the same character may be checked and thus potentially reported twice. To avoid this case, we can use a dynamic hash table at query time which stores all the characters reported so-far. Whenever we find that a character has been already reported, then the query stops and does not report more characters, since a correct query answer can not return the same character twice at the same position.
References
Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Text indexing and dictionary matching with one error. J. Algorithms 37(2), 309–325 (2000)
Barbay, J., He, M., Munro, J.I., Satti, S.R.: Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms 7(4), 52 (2011)
Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)
Belazzougui, D.: Faster and space-optimal edit distance “1” dictionary. In: Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 154–167 (2009)
Belazzougui, D.: Improved space-time tradeoffs for approximate full-text indexing with one edit error. Algorithmica (2014). doi:10.1007/s00453-014-9873-9
Belazzougui, D., Navarro, G.: New lower and upper bounds for representing sequences. In: Proceedings of the 20th Annual European Symposium on Algorithms (ESA), pp. 181–192 (2012)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23 (2014)
Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 280–292 (2012)
Belazzougui, D., Venturini, R.: Compressed static functions with applications. In: Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 229–240 (2013)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, pp. 65–74. Springer (1996)
Brodal, G.S., Srinivasan, V.: Improved bounds for dictionary look-up with one error. Inf. Process. Lett. 75(1–2), 57–59 (2000)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The Bloomier filter: an efficient data structure for static support lookup tables. In: Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 30–39 (2004)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)
Dietzfelbinger, M., Gil, J., Matias, Y., Pippenger, N.: Polynomial hash functions are reliable (extended abstract). In: Proceeding of the 19th International Colloquium on Automata, Languages and Programming (ICALP), pp. 235–246 (1992)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21, 246–260 (1974)
Fano, RM.: On the number of bits required to implement anassociative memory. Memorandum 61, Computer Structures Group, Project MAC (1971)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 12 (2008)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1), 115–121 (2007)
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10 (2010)
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Hagerup, T., Tholey, T.: Efficient minimal perfect hashing in nearly minimal space. In: Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science (STACS), pp. 317–326 (2001)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)
Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)
Munro, J.I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31(3), 762–776 (2001)
Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Comput. Surv. 39(1), 2 (2007)
Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: SPIRE, pp. 347–358 (2010)
Pagh, A., Pagh, R., Rao, S.S.: An optimal bloom filter replacement. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 823–829 (2005)
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Burlington (1999)
Yao, A.C.-C., Yao, F.F.: Dictionary look-up with one error. J. Algorithms 25(1), 194–202 (1997)
Author information
Authors and Affiliations
Corresponding author
Additional information
The work is an extended version of the paper [8] appeared in Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching, 2012. This work has been partially supported by Academy of Finland under Grant 250345 (CoECGR), the French ANR-2010-COSI-004 MAPPI project, PRIN ARS Technomedia 2012, the Midas EU Project, Grant Agreement No. 318786, and the eCloud EU Project, Grant Agreement No. 325091.
Rights and permissions
About this article
Cite this article
Belazzougui, D., Venturini, R. Compressed String Dictionary Search with Edit Distance One. Algorithmica 74, 1099–1122 (2016). https://doi.org/10.1007/s00453-015-9990-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-015-9990-0