Skip to main content

Efficient Duplicate Detection on Cloud Using a New Signature Scheme

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6897))

Abstract

Duplicate detection has been well recognized as a crucial task to improve the quality of data. Related work on this problem mainly aims to propose efficient approaches over a single machine. However, with increasing volume of the data, the performance to identify duplicates is still far from satisfactory. Hence, we try to handle the problem of duplicate detection over MapReduce, a share-nothing paradigm. We argue the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs. In this paper, we proposed a new signature scheme with new pruning strategy over MapReduce to minimize the number of candidate record pairs. Our experimental results over both real and synthetic datasets demonstrate that our proposed signature based method is efficient and scalable.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)

    Google Scholar 

  2. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  3. Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. CSIRO Mathematical and Information Sciences Technical Report 3, 83 (2003)

    Google Scholar 

  4. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)

    Google Scholar 

  5. Wang, J., Feng, J., Li, G.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1-2), 1219–1230 (2010)

    Google Scholar 

  6. Zhang, Z., Hadjieleftheriou, M., Ooi, B., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926 (2010)

    Google Scholar 

  7. Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)

    Google Scholar 

  8. Vernica, R., Carey, M., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD, pp. 495–506 (2010)

    Google Scholar 

  9. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

    Google Scholar 

  10. White, T.: Hadoop: The Definitive Guide. Yahoo Press (2010)

    Google Scholar 

  11. Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD, p. 138 (1995)

    Google Scholar 

  12. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96 (2005)

    Google Scholar 

  13. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 1–16 (2007)

    Google Scholar 

  14. Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)

    Article  MATH  Google Scholar 

  15. Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  16. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)

    Article  Google Scholar 

  17. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR, pp. 284–291 (2006)

    Google Scholar 

  18. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J., Yang, C.: Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering 13(1), 64–78 (2002)

    Article  Google Scholar 

  19. Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT, pp. 525–536 (2010)

    Google Scholar 

  20. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rong, C., Lu, W., Du, X., Zhang, X. (2011). Efficient Duplicate Detection on Cloud Using a New Signature Scheme. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds) Web-Age Information Management. WAIM 2011. Lecture Notes in Computer Science, vol 6897. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23535-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23535-1_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23534-4

  • Online ISBN: 978-3-642-23535-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics