Skip to main content
Log in

Efficient top-K approximate searches against a relation with multiple attributes

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

In this paper, we study the problem of efficiently identifying K records that are most similar to a given query record, where the similarity is defined as: (1) for each record, we calculate the similarity score between the record and the query record over each individual attribute using a specific similarity function; (2) an aggregate function is utilized to combine these similarity scores with weights and the aggregated value is served as the similarity of the record. After similarities of all records have been computed, K records with the greatest similarities can further be identified. Under this framework, unfortunately, the computational cost will be extremely expensive when the cardinality of relation is large as computation of similarity for each record is required. As a result, in this paper, we propose two efficient algorithms, named ScanIndex and Top-Down (TD for short), to cope with this problem. With respect to ScanIndex, similarity scores that are equal to zero over individual attributes are free from computation. Based on ScanIndex, with respect to TD, similarity scores less than thresholds (rather than zero) over individual attributes are skipped, where these thresholds are improved dynamically over time. Experimental results demonstrate that, comparing with the naive approach, the performance can be improved by two orders of magnitude using ScanIndex and TD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. VLDB, pp. 918–929 (2006)

  2. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. IEEE Data Eng. Bull. 29(2), 13–20 (2006)

    Google Scholar 

  3. Cao, P., Wang, Z.: Efficient top-K query calculation in distributed networks. In: Proc. PODC, pp. 206–215 (2007)

  4. Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (2007)

  5. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proc. ICDE, p. 5 (2006)

  6. Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: Proc. VLDB, pp.327–338 (2007)

  7. Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press/McGraw-Hill, Cambridge/New York, pp. 802–803 (2001)

    MATH  Google Scholar 

  9. Fagin, R.: Combining fuzzy information from multiple. J. Comput. Syst. Sci. 58(1), 216–226 (1996)

    MathSciNet  Google Scholar 

  10. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proc. PODS, pp. 102–113 (2001)

  11. Faloutsos, M., Faloutsos, P. Faloutsos, C.: On power-law relationships of the Internet topology. In: Proc. SIGCOMM, pp. 251–262 (1999)

  12. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proc. VLDB, pp. 371–380 (2001)

  13. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proc. VLDB, pp. 491–500 (2001)

  14. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proc. WWW, pp. 90–101 (2003)

  15. Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Proc. VLDB, pp. 636–647 (2004)

  16. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  17. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  18. Ilyas, I.F., Beskales, G., Soliman, M.A.: Survey of top-k query processing techniques in relational database systems. In: ACM Computing Surveys (2008)

  19. Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proc. WWW, pp. 371–380 (2009)

  20. Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: Proc. SIGMOD, pp. 779–790 (2004)

  21. Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Proc. VLDB, pp. 1078–1086 (2004)

  22. Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proc. SIGMOD, pp. 802–803 (2006)

  23. Li, C., Wang, B., Yang, K.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proc. VLDB, pp. 303–314 (2007)

  24. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proc. ICDE, pp. 257–266 (2008)

  25. Lu, W., Rong, C., Chen, J., Du, X., Fung, G., Zhou, X.: Efficient common item extraction from multiple sorted lists. In: Proc. Apweb (2010)

  26. Lu, W., Fung, G.P., Du, X., Zhou, X., Chen, L., Deng, K.: Approximate entity extraction in temporal databases. World Wide Web J. 14(2), 157–186 (2011)

    Article  Google Scholar 

  27. Luo, Y., Lin, X., Wang, W., Zhou, X.: SPARK: Top-K keyword query in relational databases. In: Proc. SIGMOD, pp. 115–126 (2007)

  28. Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  29. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 3, 31–88 (2001)

    Article  Google Scholar 

  30. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)

  31. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proc. SIGMOD, pp. 759–770 (2009)

  32. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, New York (1999)

    Google Scholar 

  33. Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    Google Scholar 

  34. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. WWW, pp. 131–140 (2008)

  35. Xiao, C., Wang, W., Lin, K., Shang, H.: Top-k set similarity joins. In: Proc. ICDE, pp. 916–927 (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyong Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, W., Chen, J., Du, X. et al. Efficient top-K approximate searches against a relation with multiple attributes. World Wide Web 14, 573–597 (2011). https://doi.org/10.1007/s11280-011-0137-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-011-0137-1

Keywords

Navigation