Abstract
String similarity join is an essential operation of many applications that need to find all similar string pairs from given two collections. The existing approaches are using the uniform and predefined similarity thresholds. While in real applications, regarding that the longer string pairs typically tolerate many more typos, it is necessary to apply variable thresholds to different strings instead of a constant one. Therefore, we proposed a solution for string similarity joins with different similarity thresholds in one procedure. In order to support different similarity thresholds, we devised the similarity aware index and index probing technique. To our best knowledge, it is the first work to address the problem. Experimental results on real-world datasets show that our solution can tackle with different similarity thresholds efficiently.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140. ACM (2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 61–72. IEEE (2006)
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96. ACM (2005)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. TKDE 19(1), 1–16 (2007)
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, e.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500. ACM (2001)
Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138. ACM (1995)
Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. In: PVLDB, pp. 625–636. ACM (2014)
Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: SIGMOD, pp. 373–384. ACM (2013)
Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: SIGKDD, pp. 267–270. ACM (1996)
Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)
Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.: Efficient and scalable processing of string similarity join. TKDE 25(10), 2217–2230 (2013)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754. ACM (2004)
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Computer Vision, pp. 1470–1477. IEEE (2003)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96. ACM (2012)
Winkler, W.: The state of record linkage and current research problems. In: Statistical Research Division (1999)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999)
Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140. ACM (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Rong, C., Zhang, X. (2015). String Similarity Join with Different Thresholds. In: Zhang, S., Wirsing, M., Zhang, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2015. Lecture Notes in Computer Science(), vol 9403. Springer, Cham. https://doi.org/10.1007/978-3-319-25159-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-25159-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25158-5
Online ISBN: 978-3-319-25159-2
eBook Packages: Computer ScienceComputer Science (R0)