Skip to main content

Active Duplicate Detection

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5981))

Included in the following conference series:

Abstract

The aim of duplicate detection is to group records in a relation which refer to the same entity in the real world such as a person or business. Most existing works require user specified parameters such as similarity threshold in order to conduct duplicate detection. These methods are called user-first in this paper. However, in many scenarios, pre-specification from the user is very hard and often unreliable, thus limiting applicability of user-first methods. In this paper, we propose a user-last method, called Active Duplicate Detection (ADD), where an initial solution is returned without forcing user to specify such parameters and then user is involved to refine the initial solution. Different from user-first methods where user makes decision before any processing, ADD allows user to make decision based on an initial solution. The identified initial solution in ADD enjoys comparatively high quality and is easy to be refined in a systematic way (at almost zero cost).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hernandez, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD (1995)

    Google Scholar 

  2. Hernandez, M., Stolfo, S.: Real-world data is dirty: data cleansing and the merge/purge problem for large databases. Data mining and knowledge discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  3. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using axtive learning. In: SIGKDD (2002)

    Google Scholar 

  4. Wang, Y., Madnick, S.: The inter-database instance identification problem in integrating autonomous systems. In: ICDE (1989)

    Google Scholar 

  5. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: A generic approach to entity resolution. The VLDB Journal (2008)

    Google Scholar 

  6. Newcombe, H.: Record linking: The design of efficient systems for linking records into individual and family histories. Am. J. Human Genetics 19(3), 335–359 (1967)

    Google Scholar 

  7. Tepping, B.: A model for optimum linkage of records. J. Am. Statistical Assoc. 63(324), 1321–1332 (1968)

    Article  Google Scholar 

  8. Felligi, I., Sunter, A.: A theory for record linkage. Journal of the Amercian Statistical Society 64, 1183–1210 (1969)

    Google Scholar 

  9. Newcombe, H.: Handbook of Record Linkage. Oxford Univ. Press, Oxford (1988)

    Google Scholar 

  10. Monge, A., Elkan, C.: An efficient domain independent algorithm for detecting approacimatly duplicate database records. In: SIGKDD (1997)

    Google Scholar 

  11. Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: SIGKDD (2003)

    Google Scholar 

  12. Cohen, W., Richman, J.: Learing to match and cluster large hihg-dimensional data sets for data integration. In: SIGKDD (2002)

    Google Scholar 

  13. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE (2005)

    Google Scholar 

  14. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. TKDE 19(1), 1–16 (2007)

    Google Scholar 

  15. Winkler, W.: Data cleaning methods. In: SIGMOD workshop on data cleaning, record linkage, and object identification (2003)

    Google Scholar 

  16. Tejada, S., Knoblosk, C., Minton, S.: Learing domain-independent string trandformation weights for high accuracy object identification. In: SIGKDD (2002)

    Google Scholar 

  17. Jain, A., Dubes, R.: Algorithms for clustering data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  18. Chaudhuri, S., Sarma, A.D., Ganti, V., Kaushik, R.: Leveraging aggregate constraints for deduplication. In: SIGMOD (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deng, K., Wang, L., Zhou, X., Sadiq, S., Fung, G.P.C. (2010). Active Duplicate Detection. In: Kitagawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 5981. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12026-8_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12026-8_43

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12025-1

  • Online ISBN: 978-3-642-12026-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics