Efficient top-K approximate searches against a relation with multiple attributes

Lu, Wei; Chen, Jinchuan; Du, Xiaoyong; Wang, Jieping; Pan, Wei

doi:10.1007/s11280-011-0137-1

Efficient top-K approximate searches against a relation with multiple attributes

Published: 12 August 2011

Volume 14, pages 573–597, (2011)
Cite this article

World Wide Web Aims and scope Submit manuscript

Wei Lu^1,2,
Jinchuan Chen²,
Xiaoyong Du^1,2,
Jieping Wang³ &
…
Wei Pan⁴

134 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we study the problem of efficiently identifying K records that are most similar to a given query record, where the similarity is defined as: (1) for each record, we calculate the similarity score between the record and the query record over each individual attribute using a specific similarity function; (2) an aggregate function is utilized to combine these similarity scores with weights and the aggregated value is served as the similarity of the record. After similarities of all records have been computed, K records with the greatest similarities can further be identified. Under this framework, unfortunately, the computational cost will be extremely expensive when the cardinality of relation is large as computation of similarity for each record is required. As a result, in this paper, we propose two efficient algorithms, named ScanIndex and Top-Down (TD for short), to cope with this problem. With respect to ScanIndex, similarity scores that are equal to zero over individual attributes are free from computation. Based on ScanIndex, with respect to TD, similarity scores less than thresholds (rather than zero) over individual attributes are skipped, where these thresholds are improved dynamically over time. Experimental results demonstrate that, comparing with the naive approach, the performance can be improved by two orders of magnitude using ScanIndex and TD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proc. VLDB, pp. 918–929 (2006)
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Su, Q., Thavisomboon, S., Widom, J.: Generic entity resolution in the SERF project. IEEE Data Eng. Bull. 29(2), 13–20 (2006)
Google Scholar
Cao, P., Wang, Z.: Efficient top-K query calculation in distributed networks. In: Proc. PODC, pp. 206–215 (2007)
Chandel, A., Hassanzadeh, O., Koudas, N., Sadoghi, M., Srivastava, D.: Benchmarking declarative approximate selection predicates. In: SIGMOD, pp. 353–364 (2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proc. ICDE, p. 5 (2006)
Chaudhuri, S., Chen, B.-C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: Proc. VLDB, pp.327–338 (2007)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press/McGraw-Hill, Cambridge/New York, pp. 802–803 (2001)
MATH Google Scholar
Fagin, R.: Combining fuzzy information from multiple. J. Comput. Syst. Sci. 58(1), 216–226 (1996)
MathSciNet Google Scholar
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proc. PODS, pp. 102–113 (2001)
Faloutsos, M., Faloutsos, P. Faloutsos, C.: On power-law relationships of the Internet topology. In: Proc. SIGCOMM, pp. 251–262 (1999)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative data cleaning: language, model, and algorithms. In: Proc. VLDB, pp. 371–380 (2001)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proc. VLDB, pp. 491–500 (2001)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proc. WWW, pp. 90–101 (2003)
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Proc. VLDB, pp. 636–647 (2004)
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Ilyas, I.F., Beskales, G., Soliman, M.A.: Survey of top-k query processing techniques in relational database systems. In: ACM Computing Surveys (2008)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proc. WWW, pp. 371–380 (2009)
Kaushik, R., Krishnamurthy, R., Naughton, J.F., Ramakrishnan, R.: On the integration of structure indexes and inverted lists. In: Proc. SIGMOD, pp. 779–790 (2004)
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Proc. VLDB, pp. 1078–1086 (2004)
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: Proc. SIGMOD, pp. 802–803 (2006)
Li, C., Wang, B., Yang, K.: VGRAM: improving performance of approximate queries on string collections using variable-length grams. In: Proc. VLDB, pp. 303–314 (2007)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proc. ICDE, pp. 257–266 (2008)
Lu, W., Rong, C., Chen, J., Du, X., Fung, G., Zhou, X.: Efficient common item extraction from multiple sorted lists. In: Proc. Apweb (2010)
Lu, W., Fung, G.P., Du, X., Zhou, X., Chen, L., Deng, K.: Approximate entity extraction in temporal databases. World Wide Web J. 14(2), 157–186 (2011)
Article Google Scholar
Luo, Y., Lin, X., Wang, W., Zhou, X.: SPARK: Top-K keyword query in relational databases. In: Proc. SIGMOD, pp. 115–126 (2007)
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)
Article MathSciNet MATH Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 3, 31–88 (2001)
Article Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proc. SIGMOD, pp. 759–770 (2009)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, New York (1999)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proc. WWW, pp. 131–140 (2008)
Xiao, C., Wang, W., Lin, K., Shang, H.: Top-k set similarity joins. In: Proc. ICDE, pp. 916–927 (2009)

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, 100872, China
Wei Lu & Xiaoyong Du
Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, Beijing, China
Wei Lu, Jinchuan Chen & Xiaoyong Du
China Electronics Standardization Institute, Beijing, China
Jieping Wang
School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
Wei Pan

Authors

Wei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jinchuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar
Jieping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyong Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, W., Chen, J., Du, X. et al. Efficient top-K approximate searches against a relation with multiple attributes. World Wide Web 14, 573–597 (2011). https://doi.org/10.1007/s11280-011-0137-1

Download citation

Received: 22 April 2010
Revised: 13 June 2011
Accepted: 14 June 2011
Published: 12 August 2011
Issue Date: October 2011
DOI: https://doi.org/10.1007/s11280-011-0137-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient top-K approximate searches against a relation with multiple attributes

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

K-Means algorithm based on multi-feature-induced order

A survey of density based clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient top-K approximate searches against a relation with multiple attributes

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

K-Means algorithm based on multi-feature-induced order

A survey of density based clustering algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation