Active Duplicate Detection

Deng, Ke; Wang, Liwei; Zhou, Xiaofang; Sadiq, Shazia; Fung, Gabriel Pui Cheong

doi:10.1007/978-3-642-12026-8_43

Ke Deng²⁰,
Liwei Wang²¹,
Xiaofang Zhou²⁰,
Shazia Sadiq²⁰ &
…
Gabriel Pui Cheong Fung²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5981))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1300 Accesses
1 Citations

Abstract

The aim of duplicate detection is to group records in a relation which refer to the same entity in the real world such as a person or business. Most existing works require user specified parameters such as similarity threshold in order to conduct duplicate detection. These methods are called user-first in this paper. However, in many scenarios, pre-specification from the user is very hard and often unreliable, thus limiting applicability of user-first methods. In this paper, we propose a user-last method, called Active Duplicate Detection (ADD), where an initial solution is returned without forcing user to specify such parameters and then user is involved to refine the initial solution. Different from user-first methods where user makes decision before any processing, ADD allows user to make decision based on an initial solution. The identified initial solution in ADD enjoys comparatively high quality and is easy to be refined in a systematic way (at almost zero cost).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)
Google Scholar
Hernandez, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem for large databases. Data mining and knowledge discovery 2(1), 9–37 (1998)
Article Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using axtive learning. In: SIGKDD (2002)
Google Scholar
Wang, Y., Madnick, S.: The inter-database instance identification problem in integrating autonomous systems. In: ICDE (1989)
Google Scholar
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: A generic approach to entity resolution. The VLDB Journal (2008)
Google Scholar
Newcombe, H.: Record linking: The design of efficient systems for linking records into individual and family histories. Am. J. Human Genetics 19(3), 335–359 (1967)
Google Scholar
Tepping, B.: A model for optimum linkage of records. J. Am. Statistical Assoc. 63(324), 1321–1332 (1968)
Article Google Scholar
Felligi, I., Sunter, A.: A theory for record linkage. Journal of the Amercian Statistical Society 64, 1183–1210 (1969)
Google Scholar
Newcombe, H.: Handbook of Record Linkage. Oxford Univ. Press, Oxford (1988)
Google Scholar
Monge, A., Elkan, C.: An efficient domain independent algorithm for detecting approacimatly duplicate database records. In: SIGKDD (1997)
Google Scholar
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)
Google Scholar
Cohen, W., Richman, J.: Learing to match and cluster large hihg-dimensional data sets for data integration. In: SIGKDD (2002)
Google Scholar
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE (2005)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. TKDE 19(1), 1–16 (2007)
Google Scholar
Winkler, W.: Data cleaning methods. In: SIGMOD workshop on data cleaning, record linkage, and object identification (2003)
Google Scholar
Tejada, S., Knoblosk, C., Minton, S.: Learing domain-independent string trandformation weights for high accuracy object identification. In: SIGKDD (2002)
Google Scholar
Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Queensland, Australia
Ke Deng, Xiaofang Zhou, Shazia Sadiq & Gabriel Pui Cheong Fung
Wuhan University, China
Liwei Wang

Authors

Ke Deng
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shazia Sadiq
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Pui Cheong Fung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, Tennodai, Tsukuba, 305–8573, Ibaraki, Japan
Hiroyuki Kitagawa
Information Technology Center, Nagoya University, Furo-cho, Chikusa-ku, 464-8601, Nagoya, Japan
Yoshiharu Ishikawa
City University of Hong Kong, Department of Computer Science, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Department of Information Science, Ochanomizu University, 2-1-1, Otsuka, Bunkyo-ku, 112-8610, Tokyo, Japan
Chiemi Watanabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, K., Wang, L., Zhou, X., Sadiq, S., Fung, G.P.C. (2010). Active Duplicate Detection. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 5981. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12026-8_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-12026-8_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12025-1
Online ISBN: 978-3-642-12026-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics